DataScience AMJ Road Map
1. Use Case: Profanity Filter
Problem: Detect profanity words and flag such Content.
RoadMap:
- Feasibility Track (effort: one week)
- review existing methods: look-up based and model-based
- create positive and negative test data
- run profanity, and better_profanity, profanity_check
- ensemble them
- evaluate the performance on individual and ensemble approaches on test data
- design document
- Production Track (effort: one week)
- Design Review and Documentation (telemetry, and data models for capturing feedback)
- Write test cases for positive and negative examples
- Add profanity flagging tag to existing Auto Tagging Module
- And possibly write an "alert event" if a profanity tag turn positive
Rest of the flow is same as Productionizing Auto Tagging
Task Dependency Graph
1 → 2
(1a | 1b) → 1c → 1d → 1d → 1e → 1f
2a → 2b → 2c → 2d
External Dependencies
- DS VM
DS Resources:
- Tech Manager
- One DS developer (can be split between two)
- Solution Manager: Mohit
TBD: Align it with AMJ Sprint Cycles
2. Use Case: Content Reuse
Problem: Story
RoadMap:
Experiment Track
- Data Sourcing (effort: one week)
- Pick Maths, NCERT Grade 10 Textbook (which is the basis for NCF or NCERT framework in the platform), convert it to JSON data using
- Google ocr-uri
- pypdf2
- Filter and prepare MVC test data
- Pick Maths, NCERT Grade 10 Textbook (which is the basis for NCF or NCERT framework in the platform), convert it to JSON data using
- Taxonomy Enrichment (effort: one week)
- Extract reference Taxonomy
- Take the reference framework and map the taxonomy pointers to the corresponding section in the digital twin of the textbook
- Index this data in ES (effort: one week)
- Index enriched taxonomy
- Get the pointer text object, given taxonomy pointer,
- Write ES Query
- Test and Iterate (two weeks)
- retrieval performance, and benchmark against baseline (existing) and iterate
- If performance is acceptable
- Design Review
- Prod Roadmapping
Task Dependency Graph
(1a | 1b) → 2 → 3 → 4 → 5
External Dependencies
- DS VM
DS Resources:
- Tech Manager
- Two DS developers
- Solution Manager: Mohit
TBD: Align it with AMJ Sprint Cycles
3. (Potential) Use Case: EQB Curation Smart Tagging
Context: Context and Architecture
Problem: Figure out a way to assert quality of the question bank, and use those Quality tags to rank the questions
Experiment Track:
- Manual Rules
- Check if Marking Schema (meant for a teacher) can be graduated to an Answer (that can be shown to a student)
- Learn Model based Rules (variable depending on scope)
- NLP + Image Processing to detect if a question/answer has special symbols like
- polynomial
- matrix
- shapes ...
- NLP + Image Processing to detect if a question/answer has special symbols like
- Load Rules (effort: 2 days)
- Code/Load the rules/models
- Apply the Rubrics and Summarize the results (effort: 2 days)
- Populate this data back to Google Sheets (as a means of sharing data with EQB team)
- Evaluate
- Pick a sample and see if rules are holding well against the data
- Validate the generated tag by EQB team.
- Summarize the results
- If results are acceptable,
- Design Review for Prod
DS Resources:
- Tech Manager
- One/Two DS developers depending on availability and urgency
- Solution Manager: Suren
Task Dependency Graph
(1 | 2) → 3 → 4 → 5 → 6 → 7
External Dependencies
- Rubric from EQB team
- Bandwidth from EQB team for providing validation inputs
- Making the Question and Marking Scheme available for easy view by Validation team
- DS VM
TBD: Align it with AMJ Sprint Cycles
4. Use Case: Taxonomy Alignment
Context: Context and Architecture
Problem: A Textbook will be created that uses a Framework (Taxonomy) available in the platform. And against the chapter, topic of that Textbook, question-answer packets have to be created. However, the Taxonomy terms available on the Question Bank may not necessarily align completely with the Taxonomy on the platform. How do we align them?
Experiment Track:
- Manual Mapping
- Extract Taxonomy Terms from Question Bank
- Done using scripts.
- Select a corresponding Framework, for a given, board, subject, grade, medium,
- All existing Frameworks were extracted and kept in an Spreadsheet
- Manually align the terms
- Once the terms are mapped, every Question can be tagged to Framework on the platform
- Extract Taxonomy Terms from Question Bank
- Mapping via Search
- Take data at the end of previous step as ground truth
- Use Enriched Taxonomy from Content Reuse Use Case (see above)
- Enrich Question Data (extract Text data from the Questions, do Smart Tagging)
- Treat Mapping as a Search Problem: Given a Question, which Taxonomy Term it should be mapped to
- Validate based on ground truth above
Production Track:
- Proposal and Review
- Define the data models
- Ingest the Platform taxonomy terms or have look up (from at step 3 above)
DS Resources:
- Tech Manager
- One/Two DS developers depending on availability and urgency
- Solution Manager: Suren
Task Dependency Graph
(1 | 2) → 3 → 4 → 5 → 6 → 7
External Dependencies
- Rubric from EQB team
- Bandwidth from EQB team for providing validation inputs
- Making the Question and Marking Scheme available for easy view by Validation team
- DS VM
TBD: Align it with AMJ Sprint Cycles
5. (Explore) Use Cases: Content Reuse and EQB Search
Problem: Content Reuse and EQB are two sides of the same coin. They both are search problems at their core. We did a quick and dirty POC here to try Deep Learning models. The results were very encouraging but we have to cautious as the model is probably overfit. We have not done thorough evaluation.
Experiment Track A
- Validate the results from previous spike, and see if they are generalizing well.
Experiment Track B
- Data Sourcing (effort: one week)
- prepare query, document data pairs (done)
- prepare query, document data pairs based on Taxonomy Enrichment words, and Enriched Content Model data (use normalized words only withtout lemming and stemming)
- Model Sourcing (effort: 3 days)
- Download ELMO, Universal Sentence Encoders (from tensorhub), and work with them
- Update the architecture (2 days)
- Instead of using bilstm, use ConvNets as query documents are bag of words and not really sentences
- Use Google UST as phrase, sentences representations, and learn a simple classifier (replace bi-lstms with a ffn)
- Model Search (two weeks)
- use DNN with single dense layer as baseline
- Use AutoML or NAS or Ray for automatic architecture search
- Test and Iterate (two weeks)
- retrieval performance, and benchmark against baseline (existing) and iterate
- If performance is acceptable
- Design Review
- Prod Roadmapping
Task Dependency Graph
A → B
(B1a | B1b) → B2 → (B3a | B3b ) → (B4a | B4b) → B5 → B6
External Dependencies
- DS VM
DS Resources:
- Tech Manager
- Two DS developers
- Solution Manager: Mohit
TBD: Align it with AMJ Sprint Cycles