1. Use Case: Profanity Filter

Problem: Detect profanity words and flag such Content.

RoadMap:

Feasibility Track (effort: one week)
1. review existing methods: look-up based and model-based
2. create positive and negative test data
  1. seed test data from public data sets: cmu, kaggle and other sources, word list, word list:
  2. create own test data
3. run profanity, and better_profanity, profanity_check
4. ensemble them
5. evaluate the performance on individual and ensemble approaches on test data
6. design document
Production Track (effort: one week)
1. Design Review and Documentation (telemetry, and data models for capturing feedback)
2. Write test cases for positive and negative examples
3. Add profanity flagging tag to existing Auto Tagging Module
4. And possibly write an "alert event" if a profanity tag turn positive

Rest of the flow is same as Productionizing Auto Tagging

Task Dependency Graph

1 → 2

(1a | 1b) → 1c → 1d → 1d → 1e → 1f

2a → 2b → 2c → 2d

External Dependencies

DS VM

DS Resources:

Tech Manager
One DS developer (can be split between two)
Solution Manager: Mohit

TBD: Align it with AMJ Sprint Cycles

2. Use Case: Content Reuse

Problem: Story

RoadMap:

Experiment Track

Data Sourcing (effort: one week)
1. Pick Maths, NCERT Grade 10 Textbook (which is the basis for NCF or NCERT framework in the platform), convert it to JSON data using
  1. Google ocr-uri
  2. pypdf2
2. Filter and prepare MVC test data
Taxonomy Enrichment (effort: one week)
1. Extract reference Taxonomy
2. Take the reference framework and map the taxonomy pointers to the corresponding section in the digital twin of the textbook
Index this data in ES (effort: one week)
1. Index enriched taxonomy
2. Get the pointer text object, given taxonomy pointer,
3. Write ES Query
Test and Iterate (two weeks)
1. retrieval performance, and benchmark against baseline (existing) and iterate
If performance is acceptable
1. Design Review
2. Prod Roadmapping

Task Dependency Graph

(1a | 1b) → 2 → 3 → 4 → 5

External Dependencies

DS VM

DS Resources:

Tech Manager
Two DS developers
Solution Manager: Mohit

TBD: Align it with AMJ Sprint Cycles

3. (Potential) Use Case: EQB Curation Smart Tagging

Context: Context and Architecture

Problem: Figure out a way to assert quality of the question bank, and use those Quality tags to rank the questions

Experiment Track:

Manual Rules
1. Check if Marking Schema (meant for a teacher) can be graduated to an Answer (that can be shown to a student)
Learn Model based Rules (variable depending on scope)
1. NLP + Image Processing to detect if a question/answer has special symbols like
  1. polynomial
  2. matrix
  3. shapes ...
Load Rules (effort: 2 days)
1. Code/Load the rules/models
Apply the Rubrics and Summarize the results (effort: 2 days)
Populate this data back to Google Sheets (as a means of sharing data with EQB team)
Evaluate
1. Pick a sample and see if rules are holding well against the data
2. Validate the generated tag by EQB team.
3. Summarize the results
If results are acceptable,
1. Design Review for Prod

DS Resources:

Tech Manager
One/Two DS developers depending on availability and urgency
Solution Manager: Suren

Task Dependency Graph

(1 | 2) → 3 → 4 → 5 → 6 → 7

External Dependencies

Rubric from EQB team
Bandwidth from EQB team for providing validation inputs
Making the Question and Marking Scheme available for easy view by Validation team
DS VM

TBD: Align it with AMJ Sprint Cycles

4. Use Case: Taxonomy Alignment

Context: Context and Architecture

Problem: A Textbook will be created that uses a Framework (Taxonomy) available in the platform. And against the chapter, topic of that Textbook, question-answer packets have to be created. However, the Taxonomy terms available on the Question Bank may not necessarily align completely with the Taxonomy on the platform. How do we align them?

Experiment Track:

Manual Mapping
1. Extract Taxonomy Terms from Question Bank
  1. Done using scripts.
2. Select a corresponding Framework, for a given, board, subject, grade, medium,
  1. All existing Frameworks were extracted and kept in an Spreadsheet
3. Manually align the terms
  1. Once the terms are mapped, every Question can be tagged to Framework on the platform
Mapping via Search
1. Take data at the end of previous step as ground truth
2. Use Enriched Taxonomy from Content Reuse Use Case (see above)
3. Enrich Question Data (extract Text data from the Questions, do Smart Tagging)
4. Treat Mapping as a Search Problem: Given a Question, which Taxonomy Term it should be mapped to
5. Validate based on ground truth above

Production Track:

Proposal and Review
Define the data models
Ingest the Platform taxonomy terms or have look up (from at step 3 above)

DS Resources:

Tech Manager
One/Two DS developers depending on availability and urgency
Solution Manager: Suren

Task Dependency Graph

(1 | 2) → 3 → 4 → 5 → 6 → 7

External Dependencies

Rubric from EQB team
Bandwidth from EQB team for providing validation inputs
Making the Question and Marking Scheme available for easy view by Validation team
DS VM

TBD: Align it with AMJ Sprint Cycles

5. (Explore) Use Cases: Content Reuse and EQB Search

Problem: Content Reuse and EQB are two sides of the same coin. They both are search problems at their core. We did a quick and dirty POC here to try Deep Learning models. The results were very encouraging but we have to cautious as the model is probably overfit. We have not done thorough evaluation.

Experiment Track A

Validate the results from previous spike, and see if they are generalizing well.

Experiment Track B

Data Sourcing (effort: one week)
1. prepare query, document data pairs (done)
2. prepare query, document data pairs based on Taxonomy Enrichment words, and Enriched Content Model data (use normalized words only withtout lemming and stemming)
Model Sourcing (effort: 3 days)
1. Download ELMO, Universal Sentence Encoders (from tensorhub), and work with them
Update the architecture (2 days)
1. Instead of using bilstm, use ConvNets as query documents are bag of words and not really sentences
2. Use Google UST as phrase, sentences representations, and learn a simple classifier (replace bi-lstms with a ffn)
Model Search (two weeks)
1. use DNN with single dense layer as baseline
2. Use AutoML or NAS or Ray for automatic architecture search
Test and Iterate (two weeks)
1. retrieval performance, and benchmark against baseline (existing) and iterate
If performance is acceptable
1. Design Review
2. Prod Roadmapping

Task Dependency Graph

A → B

(B1a | B1b) → B2 → (B3a | B3b ) → (B4a | B4b) → B5 → B6

External Dependencies

DS VM

DS Resources:

Tech Manager
Two DS developers
Solution Manager: Mohit

TBD: Align it with AMJ Sprint Cycles

Sunbird Design

DataScience AMJ Road Map

1. Use Case: Profanity Filter

2. Use Case: Content Reuse

3. (Potential) Use Case: EQB Curation Smart Tagging

4. Use Case: Taxonomy Alignment

5. (Explore) Use Cases: Content Reuse and EQB Search