[Design Brainstorm] - MVC - Content Reuse
Introduction
As a Sourcing Org Contributor when I try to Add from Library, I will get the most relevant content. Most relevant content is defined based on the match between My textbook and Textbooks in Library.
As a Sourcing Org Contributor, I will get two options against each chapter (unit) in the textbook.
(New) Add from Library
Add New
(New) As a Sourcing Org Contributor, I will be able to Add from Library by Exploring and by viewing Suggestions.
Add from Library will have two ways: “Explore” (or “Library”) and “Suggestions” (or “Recommended”).
(New) As a Sourcing Org Contributor, I will be able to Explore pre-filtered set of textbooks, chapter (topic) wise.
(New) As a Sourcing Org Contributor, I will be able to Preview selected content quickly.
(New) As a Sourcing Org Contributor, I will Add Selected Content to any chapter (unit) in My Textbook.
(New) As a Sourcing Org Contributor, I will be able to view Suggested content, Preview, and Add to the Selected chapter (unit) in My Textbook.
Problem Statement
Explore: When a textbook- TOC is uploaded, for each chapter/ topic, show similar chapter/ topic in other textbook and the content linked to those topics
Based on the filter
Search for Chapter/topics in other textbooks based on textbook nodes as query
Suggest: When a textbook- TOC is uploaded, for each chapter/ topic, show 5 MVC Content that can be linked to each chapter/topic
Search for enriched MVC Content based on textbook nodes as query
Search for enriched MVC Content embeddings based on textbook nodes query as embedding
MVC Content
Create a standard excel format for MVC Content
State
Type
Board [Important]
Grade [Important]
Subject [Important]
Medium [Important]
Textbook Name [Important]
Chapter No.
Chapter Name (Level 1) [Important]
Chapter Concept Name (Level 1) [Important]
Topic Name (Level 2) [Important]
Topic Concept Name (Level 2) [Important]
Sub Topic Name (Level 3) [Important]
Sub Topic Concept Name (Level 3) [Important]
Source [Important]
Content URL [Important]
Existing excel data needs to be preprocessed so that standard structure can be achieved.
Please note that one Content URL can have multiple entries.
If the Content URL is multiple then we need to merge that in the Excel. There should be only one row per content URL.
There are many cases where the Content URL is not valid so we will have to ignore those entries.
Columns Field Values mentioned in the excel will be considered as final and rest of the data will be read from Diksha Content Read API
Design
Using Elasticsearch 7.4 to make use of vector indexing for semantic search.
Using strict mapping for better performance.
Implementation Flow
Develop a script which will read the data from google sheets and merge the data into one CSV.
mvc-content-create API should handle
Excel (xls)
JSON
Develop a new mvc-content-create API which will accept the field and values defined in the excel.
Check whether the Content URL is valid or not.
Basis the Content URL, extract Content ID and other properties of the content using Content Read API of Diksha.
Board, Grade, Subject and Medium data will be picked from the excel and not from the Diksha Content and pass the values to Auto Create Event.
Trigger Auto Create Job with some extra parameters in event JSON
textbookname
level1name
level1concept
level2name
level2concept
level3name
level3concept
label (MVC)
source [Diksha, iDream, ToonMasti etc...]
sourceurl
Auto Create JOB internally calls the Content Create API of Vidyadaan and passes the appropriate request.
Auto Create JOB internally trigger Publish Pipeline of Vidyadaan.
Content Create API of Vidyadaan Changes
Change content definition of Neo4J to handle below mentioned additional columns
textbookname
level1name
level1concept
level2name
level2concept
level3name
level3concept
label (MVC)
source [Diksha, iDream, ToonMasti etc...]
sourceurl
Content Create API will insert this data in Neo4J
Publish Pipeline Changes
Publish Pipeline will insert data in vidyadaan content ES with above additional columns.
if the label parameter exists and its value is MVC, it triggers mvc-processor pipeline.
We need to insert below mentioned values in MVC ES and Cassandra
textbookname
level1name
level1concept
level2name
level2concept
level3name
level3concept
label (MVC)
source [Diksha, iDream, ToonMasti etc...]
sourceurl
Elastic Search Index Structure:
Name: mvc-content
Property | Data Type | Tokenization | Group | Description |
---|---|---|---|---|
name | Text | Yes | core - metadata |
|
description | Text | Yes |
| |
mimeType | Text | No |
| |
contentType | Text | No |
| |
resourceType | Text | No |
| |
artifactUrl | Text | No |
| |
streamingUrl | Text | No |
| |
previewUrl | Text | No |
| |
downloadUrl | Text | No |
| |
framework | Text | No |
| |
board | Text | Yes |
| |
medium | Text | Yes |
| |
subject | Text | Yes |
| |
gradeLevel | Text | Yes |
| |
keywords | Text | Yes |
| |
source | Text | Yes | source - metadata | URI of the content. This is the public URI to access the source of the MVC. |
ml_level1Concepts | Text | Yes | ml - metadata |
|
ml_level2Concepts | Text | Yes |
| |
ml_level3Concepts | Text | Yes |
| |
ml_contentText | Text | Yes | Text extracted form the pdf, video or ecml Content | |
ml_keywords | Text | Yes |
| Keywords identified from ml_contentText |
ml_content_text_vector | Dense vector | No |
| Vector representation of ml_contentText and description using pertained ml model |
label | Text | Yes |
| Tags that represent the Content. ex: MVC |
Content-service
We can use the existing code of Content Search API for MVC Search by following two approaches.
Create a new route of MVC reuse in the existing API.
Pros
All the existing utilities and dependencies can be reused.
Manageability becomes easy and both the Search API is part of one project.
Cons
It could impact the performance, though it is very less.
Deployment of Diksha will impact Vidyadaan application as well.
ES version has to be same for both Diksha and Vidyadaan.
Create a new API for MVC reuse
Pros
No impact on existing Diksha Search Service
No dependency on Deployment now, as both are separate services.
Latest ES version can be used for Vidyadaan
Cons
Maintainability would be an issue both at Code and DB level.
API Spec:
Request:
HTTP Verb: POST
Header Parameters
Content-Type: “application/json“
Authorization: “Bearer <auth-token>“
Request Parameters:
mode: soft/hard
filters
softConstraints
vector - search
{
"request": {
"mode": "explore",
"filters": {
"medium": [
"Telegu"
],
"gradeLevel": [
"Class 4",
"Class 5",
"Class 6"
],
"status": [
"Live"
],
"textbookName": [
"Science"
],
"level1Name": [
"Sorting Materials Into Groups"
],
"level1Concept": [
"Materials"
],
"level2Name": [
"Objects Around Us"
],
"level2Concept": [
"Various Objects"
]
}
}
}
Response
{
"id": "ekstep.mvc-composite-search.search",
"ver": "1.0",
"ts": "2020-05-21T22:23:43ZZ",
"params": {
"resmsgid": "c1658c85-e0a1-41ed-bd9a-72df223f505d",
"msgid": null,
"err": null,
"status": "successful",
"errmsg": null
},
"responseCode": "OK",
"result": {
"count": 3,
"content": [
{
"organisation": [
"Vidya2"
],
"channel": "sunbird",
"framework": "NCF",
"board": "State(Tamil Nadu)",
"subject": "English",
"medium": [
"Telegu"
],
"gradeLevel": [
"Class 4",
"Class 5",
"Class 6"
],
"name": "15_April_ETB",
"description": "Enter description for TextBook",
"language": [
"English"
],
"appId": "dev.dock.portal",
"contentEncoding": "gzip",
"identifier": "do_113025640118272000173",
"node_id": 5244,
"nodeType": "DATA_NODE",
"mimeType": "application/vnd.ekstep.content-collection",
"resourceType": "Book",
"contentType": [
"TextBook"
],
"objectType": "Content",
"textbookName": [
"Science"
],
"level1Name": [
"Sorting Materials Into Groups"
],
"level1Concept": [
"Materials"
],
"level2Name": [
"Objects Around Us"
],
"level2Concept": [
"Various Objects"
]
}
]
}
}
ML Workbench api:
Request:
POST /daggit/submit
{
"request":{
"input":{
"APP_HOME": "/daggit_home/content_reuse",
"content":[{
"subject": "Science",
"downloadUrl": "https://ntpproductionall.blob.core.windows.net/ntp-content-production/ecar_files/do_312533255910883328118977/muulyvaan-yogdaankrtaa-10th-vijnyaan_1532071348607_do_312533255910883328118977_2.0.ecar",
"language": ["English"],
"mimeType": "application/vnd.ekstep.ecml-archive",
"objectType": "Content",
"gradeLevel": ["Class 10"],
"artifactUrl": "https://ntpproductionall.blob.core.windows.net/ntp-content-production/content/do_312533255910883328118977/artifact/1529938853455_do_312533255910883328118977.zip",
"contentType": "Resource",
"identifier": "do_312533255910883328118977",
"graph_id": "domain",
"nodeType": "DATA_NODE",
"node_id": 575061},
{...},...]
},
"job":"diksha_content_keyword_tagging"
}
}
Response:
Text Vectorisation API:
Request:
Response:
action = getContentVec
MVC Content Create API
Request:
To create MVC content using JSON
To create MVC content using Excel
Response:
MVC Processor Samza Job - Event JSON
Stage 1: This job will get triggered from the Publish pipeline, if the Label is “MVC“
Stage 2: ML Keyword Extraction API will be triggering this event
Stage 3: ML Vectorization API will be triggering this event
MVC Cassandra Table Modification