DataScience AMJ Road Map
1. Use Case: Profanity Filter
Problem: Detect profanity words and flag such Content.
RoadMap:
Feasibility Track (effort: one week)
review existing methods: look-up based and model-based
create positive and negative test data
run profanity, and better_profanity, profanity_check
ensemble them
evaluate the performance on individual and ensemble approaches on test data
design document
Production Track (effort: one week)
Design Review and Documentation (telemetry, and data models for capturing feedback)
Write test cases for positive and negative examples
Add profanity flagging tag to existing Auto Tagging Module
And possibly write an "alert event" if a profanity tag turn positive
Rest of the flow is same as Productionizing Auto Tagging
Task Dependency Graph
1 → 2
(1a | 1b) → 1c → 1d → 1d → 1e → 1f
2a → 2b → 2c → 2d
External Dependencies
DS VM
DS Resources:
Tech Manager
One DS developer (can be split between two)
Solution Manager: Mohit
TBD: Align it with AMJ Sprint Cycles
2. Use Case: Content Reuse
Problem: Story
RoadMap:
Experiment Track
Data Sourcing (effort: one week)
Pick Maths, NCERT Grade 10 Textbook (which is the basis for NCF or NCERT framework in the platform), convert it to JSON data using
Google ocr-uri
pypdf2
Filter and prepare MVC test data
Taxonomy Enrichment (effort: one week)
Extract reference Taxonomy
Take the reference framework and map the taxonomy pointers to the corresponding section in the digital twin of the textbook
Index this data in ES (effort: one week)
Index enriched taxonomy
Get the pointer text object, given taxonomy pointer,
Write ES Query
Test and Iterate (two weeks)
retrieval performance, and benchmark against baseline (existing) and iterate
If performance is acceptable
Design Review
Prod Roadmapping
Task Dependency Graph
(1a | 1b) → 2 → 3 → 4 → 5
External Dependencies
DS VM
DS Resources:
Tech Manager
Two DS developers
Solution Manager: Mohit
TBD: Align it with AMJ Sprint Cycles
3. (Potential) Use Case: EQB Curation Smart Tagging
Context: Context and Architecture
Problem: Figure out a way to assert quality of the question bank, and use those Quality tags to rank the questions
Experiment Track:
Manual Rules
Check if Marking Schema (meant for a teacher) can be graduated to an Answer (that can be shown to a student)
Learn Model based Rules (variable depending on scope)
NLP + Image Processing to detect if a question/answer has special symbols like
polynomial
matrix
shapes ...
Load Rules (effort: 2 days)
Code/Load the rules/models
Apply the Rubrics and Summarize the results (effort: 2 days)
Populate this data back to Google Sheets (as a means of sharing data with EQB team)
Evaluate
Pick a sample and see if rules are holding well against the data
Validate the generated tag by EQB team.
Summarize the results
If results are acceptable,
Design Review for Prod
DS Resources:
Tech Manager
One/Two DS developers depending on availability and urgency
Solution Manager: Suren
Task Dependency Graph
(1 | 2) → 3 → 4 → 5 → 6 → 7
External Dependencies
Rubric from EQB team
Bandwidth from EQB team for providing validation inputs
Making the Question and Marking Scheme available for easy view by Validation team
DS VM
TBD: Align it with AMJ Sprint Cycles
4. Use Case: Taxonomy Alignment
Context: Context and Architecture
Problem: A Textbook will be created that uses a Framework (Taxonomy) available in the platform. And against the chapter, topic of that Textbook, question-answer packets have to be created. However, the Taxonomy terms available on the Question Bank may not necessarily align completely with the Taxonomy on the platform. How do we align them?
Experiment Track:
Manual Mapping
Extract Taxonomy Terms from Question Bank
Done using scripts.
Select a corresponding Framework, for a given, board, subject, grade, medium,
All existing Frameworks were extracted and kept in an Spreadsheet
Manually align the terms
Once the terms are mapped, every Question can be tagged to Framework on the platform
Mapping via Search
Take data at the end of previous step as ground truth
Use Enriched Taxonomy from Content Reuse Use Case (see above)
Enrich Question Data (extract Text data from the Questions, do Smart Tagging)
Treat Mapping as a Search Problem: Given a Question, which Taxonomy Term it should be mapped to
Validate based on ground truth above
Production Track:
Proposal and Review
Define the data models
Ingest the Platform taxonomy terms or have look up (from at step 3 above)
DS Resources:
Tech Manager
One/Two DS developers depending on availability and urgency
Solution Manager: Suren
Task Dependency Graph
(1 | 2) → 3 → 4 → 5 → 6 → 7
External Dependencies
Rubric from EQB team
Bandwidth from EQB team for providing validation inputs
Making the Question and Marking Scheme available for easy view by Validation team
DS VM
TBD: Align it with AMJ Sprint Cycles
5. (Explore) Use Cases: Content Reuse and EQB Search
Problem: Content Reuse and EQB are two sides of the same coin. They both are search problems at their core. We did a quick and dirty POC here to try Deep Learning models. The results were very encouraging but we have to cautious as the model is probably overfit. We have not done thorough evaluation.
Experiment Track A
Validate the results from previous spike, and see if they are generalizing well.
Experiment Track B
Data Sourcing (effort: one week)
prepare query, document data pairs (done)
prepare query, document data pairs based on Taxonomy Enrichment words, and Enriched Content Model data (use normalized words only withtout lemming and stemming)
Model Sourcing (effort: 3 days)
Download ELMO, Universal Sentence Encoders (from tensorhub), and work with them
Update the architecture (2 days)
Instead of using bilstm, use ConvNets as query documents are bag of words and not really sentences
Use Google UST as phrase, sentences representations, and learn a simple classifier (replace bi-lstms with a ffn)
Model Search (two weeks)
use DNN with single dense layer as baseline
Use AutoML or NAS or Ray for automatic architecture search
Test and Iterate (two weeks)
retrieval performance, and benchmark against baseline (existing) and iterate
If performance is acceptable
Design Review
Prod Roadmapping
Task Dependency Graph
A → B
(B1a | B1b) → B2 → (B3a | B3b ) → (B4a | B4b) → B5 → B6
External Dependencies
DS VM
DS Resources:
Tech Manager
Two DS developers
Solution Manager: Mohit
TBD: Align it with AMJ Sprint Cycles