Introduction
This wiki explain the current design and implementation of tracking and monitoring collections. The challenges we have at scale and the proposed design to handle them.
Background & Problem Statement
The sunbird platform supports collection tracking and monitoring. It uses the below APIs to capture the content tracking data, generates progress and score metrics and provide the summary.
Content State Update API - To capture content progress and submit assessment.
Content State Read API - To read the individual content consumption and assessment attempts status.
Enrolment List API - To access all the enrolment metrics of a given user.
The content state update API capture the content progress and assessment data. It generate events for score and overall progress computation by activity-aggregator and assessment-aggregator jobs.
We have a single API (Content State Update) to capture all the tracking information. So, it has a complex logic to identify the given input is for content progress or assessment submission and etc,.
At the end all the clients and report jobs need is the following map for every collection:
content_status = { "<content_id>": <status> } For ex: content_status = { "do_1234": 2, // Completed "do_1235": 2, // Completed "do_1236": 2, // Completed "do_1237": 1, // In Progress "do_1238": 0 // Not started }
Key Design Problems:
Single API to capture all the tracking data.
Read after write of consumption data and basic summary.
Data is written and fetched from multiple tables leading to consistency issues between API and Reporting Jobs
The low level tables (user_content_consumption & user_assessments) grow at an exponential rate when we start to track everything
Archiving old data is not possible as the API’s read data from low level tables
Design
To be able to handle the above design problems, we have analyzed how similar products (like netflix) track everything what a user does (or views) and that too at scale. Based on the analysis we have broken down the APIs into more granular APIs with a single DB update so that each API can be scaled independently. In addition we have designed the APIs as a general purpose service (similar to asset service/graph engine) on which various use-cases can be mapped.
Following are few of the Cassandra scale issues for various approaches:
Use a Map datatype - We could have used a map datatype and updated the content status via API. But this would result in multiple SSTables (per addition) and tombstones (update & deletes). As the API is the most used api and the table would have billions in records, this would result in reads getting slowed down drastically and entire cluster slowed down. We would have been forced to do compaction at regular intervals
Use a frozen map datatype - With a frozen map datatype we would have a way around multiple SSTables lookup and tombstones but the API would not be able to append/add to the map. It needs to always replace the map. This would fail if there are two concurrency write requests for a user (can happen if the data is stored offline and synced to server) and only one write would have succeeded.
We have worked around Cassandra scaling issues (read from more than 2 SSTables and tombstones) and read after write scenario by having a high performance cache at the center. With this approach:
The update APIs can update the content status in the low level table and update the content status map in redis (using hmset). Concurrent requests would not be a problem
The read API can read the content status directly from redis. Meanwhile the content_status is updated in a frozen map datatype field via the activity aggregator job.
The job serializes by user requests and reads the status from low level table before computing the overall content_status. This would ensure consistency between API and reporting
Content Consumption APIs
Assessment Consumption APIs
Viewer Service
Viewing Service collects the “content view updates” and generate events to process and provide summary to the users.
When a user starts viewing a content, a view entry created. There are three stages when a user view the content. They are start, progress and end. Considering these three stages we have 3 API endpoints to capture this information for each stage.
An event will be generated when a content view ends. The summary computation jobs will read these event to process and compute the overall summary of the collection.
The computed summary will be available from API interface to download and view.
Summary Computation Jobs - Flink:
The Flink jobs are used to read and compute the summary of a collection consumption progress when the user view ends. It also computes the score for the current view and best score using all the previous views.
The event is just a trigger to initiate the computation of the collection progress. The job fetches the raw data from DB to compute the overall progress.
When an assessment type content (Ex: QuestionSet) view ends, it expects the ASSESS events data to assessment submit API for score metrics computation.
Once the view ends, the progress and score will be updated asynchronously by the flink jobs.
Viewer Service - Database Design
user_content_consumption ( userid text, collectionid text, // currently labelled as courseid contextid text, // currently labelled as batchid contentid text, last_access_time timestamp, last_completed_time timestamp, last_updated_time timestamp, progressdetails json, status int, PRIMARY KEY (userid, collectionid, contextid, contentid) ) assessment_aggregator ( user_id text, collection_id text, // currently labelled as courseid context_id text, // currently labelled as contextid content_id text, attempt_id text, created_on timestamp, grand_total text, last_attempted_on timestamp, questions list<frozen<question>>, total_max_score double, total_score double, updated_on timestamp, PRIMARY KEY ((user_id, collection_id), context_id, content_id, attempt_id) ) user_activity_agg ( activity_type text, activity_id text, user_id text, context_id text, agg map<text, int>, content_status frozen<map<text,int>>, agg_last_updated map<text, timestamp>, PRIMARY KEY ((activity_type, activity_id, user_id), context_id) )
The user can consume a content by searching it in our platform (organically) or via a collection when the user enrolled to a course.
With Viewer-Service, we will support tracking individual content consumption also. Below details explain how the data will be stored for a content consumption in different scenarios.
The below table has various scenarios considering the current and future use cases. Here we defined the database read/write logic to support these use case and fetch the save or fetch the required data from user_content_consumption
table.
Scenario | API & DB details | |
---|---|---|
1 | User consuming individual content. | Write API Request { userid: "<userid>", contentid: "<contentid>" } Write DB Query INSERT into ucc(userid, collectionid, contextid, contentid, status) values('<userid>','<contentid>','<contentid>','<contentid>','<status>') Read API Request { userid: "<userid>", contentid: "<contentid>" } Read DB Query from ucc where userid='<userid>' and collectionid='<contentid>' and contextid='<contentid>' and contentid='<contentid>' |
2 | User consuming a content with in a collection. | Write API Request { userid: "<userid>", collectionid: "<collectionid>", contentid: "<contentid>" } Write DB Query INSERT into ucc(userid, collectionid, contextid, contentid, status) values('<userid>','<collectionid>','<collectionid>','<contentid>','<status>') Read API Request { userid: "<userid>", collection: "<collectionid>" } Read DB Query from ucc where userid='<userid>' and collectionid='<collectionid>' and contextid='<collectionid>' |
3 | User consuming a content with in a collection with a context (A batch, A program or a program batch) | Write API Request { userid: "<userid>", collectionid: "<collectionid>", contextid: "<contextid>", // batchId, programId, programBatchId contentid: "<contentid>" } Write DB Query INSERT into ucc(userid, collectionid, contextid, contentid, status) values('<userid>','<collectionid>','<contextid>','<contentid>','<status>') Read API Request { userid: "<userid>", collectionid: "<collectionid>", contextid: "<contextid>" // batchId, programId, programBatchId } Read DB Query from ucc where userid='<userid>' and collectionid='<collectionid>' and contextid='<contextid>' |
Config Modes
While the viewer-service as an infra component provides flexibility to configure any tracking use-case, there is a need for preconfigured “modes” enabled at an instance level to be able to make better use of the service and also to make any solution or usecase simpler to implement and understand.
Following are the modes one can configure the viewer service at an instance level
Strict Context Mode
Any content or collection consumption is strictly tracked at the context it was consumed. This is the default mode setup currently in SunbirdEd
Rahul completes content “single-digit-addition” as part of batch “batch-1“ within the course “class-1-maths”
Entry in DB
user_id | collection_id | context_id | content_id | status |
---|---|---|---|---|
Rahul | class-1-maths | batch-1 | single-digit-addition | 2 |
API responses for various other actions done by the user:
Action | Progress in Context |
---|---|
Rahul opens the “single-digit-addition” content after going into the course “class-1-maths” when batch-1 is active | Complete |
Rahul opens the “single-digit-addition” content after searching for it | Not Started |
batch-1 has expired and Rahul has rejoined batch-2 and opened the content “single-digit-addition” | Not Started |
Inverse: Rahul completes the “single-digit-addition” content after searching for it. Later, Rahul opens the same content via the ToC of the course “class-1-maths“ | Not Started |
Pros
Simpler logic to implement within clients as they do know the context in which the user is opening the content
No extra actions or need for complex business rules either or client or server
Cons
There might be cases where a program is launched with a course which the user is already part of or completed. The user needs to redo the entire course again (consumption outside of Context will not be considered)
User experience will be complex if the user sees two courses (one within a program and another outside of the program). The user might complete the course assuming it is part of program and would probably realise that he hasn’t taken it in the context of the program
There would be multiple steps required by user to reach a content within a program. Will make the discovery and browsing experience complex. (may not always be possible to ensure that users ‘discover’ relevant content in the context they are supposed to)
Carry Forward Mode
Any content or collection consumption is tracked at its level only and is carry forwarded into any context. This is the expected mode of SunbirdCB. In this mode there are two sub-modes as well:
Full Carry Forward Mode
Rahul completes content “single-digit-addition” as part of batch “batch-1“ within the course “class-1-maths”
Entry in DB
user_id | collection_id | context_id | content_id | status |
---|---|---|---|---|
Rahul | single-digit-addition | single-digit-addition | single-digit-addition | 2 |
API responses for various other actions done by the user:
Action | Progress |
---|---|
Rahul opens the “single-digit-addition” content after going into the course “class-1-maths” when batch-1 is active | Complete |
Rahul opens the “single-digit-addition” content after searching for it | Complete |
batch-1 has expired and Rahul has rejoined batch-2 and opened the content “single-digit-addition” | Complete |
Rahul has joined a collection “class-2-maths” which contains the “single-digit-addition” content and opens the content | Complete |
Collection Carry Forward Mode
Rahul completes content “single-digit-addition” as part of batch “batch-1“ within the course “class-1-maths”
Entry in DB
user_id | collection_id | context_id | content_id | status |
---|---|---|---|---|
Rahul | class-1-maths | class-1-maths | single-digit-addition | 2 |
API responses for various other actions done by the user:
Action | Progress |
---|---|
Rahul opens the “single-digit-addition” content after going into the course “class-1-maths” when batch-1 is active | Complete |
Rahul opens the “single-digit-addition” content after searching for it | Not Started |
batch-1 has expired and Rahul has rejoined batch-2 and opened the content “single-digit-addition” | Complete |
Rahul has joined a collection “class-2-maths” which contains the “single-digit-addition” content and opens the content | Not Started |
Rahul has joined a program “program-abc“ which contains “class-1-maths” as one of the collection. He opens the “single-digit-addition” content within this context | Complete |
Pros
Simplicity. No extra actions or need for complex business rules either or client or server
Straight-forward simple user experience
Cons
Once a collection (course) has been complete it is complete irrespective of the time period or context
If a programs wants to force all users to retake the course or content, it would mean that one has to clone the course or content. This would duplicate the assets on the platform
If a content or collection is modified it is still shown to user as complete. This may not have any meaning
Copy Mode
In this mode, any content or collection that is tracked, is copied over into context based on the business rules configured.
Let’s assume the following business rule:
An user content progress is copied into a course or a program if the user has completed the content in the last 3 months or within the program/course duration
Copy All Mode
Rahul completes content “single-digit-addition”. “single-digit-addition” is added as part of course “class-1-maths”. Rahul joins the batch-1 of course “class-1-maths”. Rahul also completes “double-digit-addition” as part of “class-1-maths“. Rahul hasn’t started “triple-digit-addition“ as part of the course.
Entry in DB
user_id | collection_id | context_id | content_id | status |
---|---|---|---|---|
Rahul | single-digit-addition | single-digit-addition | single-digit-addition | 2 |
Rahul | class-1-maths | batch-1 | single-digit-addition | 2 (copied) |
API responses for various other actions done by the user:
Action | Progress |
---|---|
Rahul opens the “single-digit-addition” content after going into the course “class-1-maths” when batch-1 is active | Complete |
Rahul opens the “single-digit-addition” content after searching for it | Complete |
batch-1 has expired and Rahul has rejoined batch-2 after 1 month and opened the content “single-digit-addition” | Complete |
batch-1 has expired and Rahul has rejoined batch-2 after 3 months and opened the content “single-digit-addition” | Not Started |
Rahul opens the “double-digit-addition” content after searching for it | Not Started |
Rahul completes “triple-digit-addition” organically. And opens the content as part of course “class-1-maths” | Complete |
Pros
Opens up many complex use-cases
UX can be made simpler. A content or collection can be shown only once
Cons
The complexity on the server side explodes exponentially. There needs to be multiple asynchronous jobs like:
Copy the progress after evaluating the business rules on actions like “enroll”
Copy the progress continuously whenever a content/collection is taken outside the context but within the acceptable business rules
Explainability of the data becomes very complex. Debugging any issue would be a nightmare. Support would be bombarded with explainability requests
UX would become complex.
User would be confused why a progress is complete or not complete
If a collection is part of multiple programs, usability will be very challenging if the collection doesn’t satisfy the business rules
UX should be designed specifically for it
There are other use-cases like if the collection is complete and added to learner passbook, would the program also add another entry or how would the learner passbook experience be?
Move Mode
In this mode, any collection that is tracked, is moved over into context based on the business rules configured. Move mode ensures that there is only one copy for the active collection (open batch or active batch). It doesn’t support moving of content.
Rahul completes content “single-digit-addition” as part of batch “batch-1“ within the course “class-1-maths”. He now enrols into a program “Program 1“ which also has the course “class-1-maths“.
Program has set the following business rule: An user collection progress is moved into a program if the user has an on-going batch for the same collection
Action | Progress |
---|---|
Rahul opens the “single-digit-addition” content after going into the course “class-1-maths” when batch-1 is active | Complete |
Rahul opens the “single-digit-addition” content after searching for it | Not Started |
batch-1 has expired and Rahul has rejoined batch-2 and opened the content “single-digit-addition” | Not Started |
Pros
Opens up complex use-cases reducing the cons of the copy mode
UX can be made simpler. A content or collection can be shown only once
Cons
There needs to be a post enrol action that evaluates the business rule and does the movement of progress
Explainability of the data is still complex. Support would be bombarded with explainability requests
UX would become complex.
User would be confused why a progress is complete or not complete or where his earlier course is.
If a collection is part of multiple programs, usability will be very challenging if the collection doesn’t satisfy the business rules
Setting the right context is also complex. It is tough for UI to know if the course is part of the program, this would mean inverse lookup queries or APIs and providing the right choice to the user
UX should be designed specifically for it
There are other use-cases like if the collection is complete and added to learner passbook, would the program also add another entry or how would the learner passbook experience be?
Extended Enrolment Consumption:
Every new instance adapting the sunbird platform will have to select one of the option from 3 context modes, this would allow the application to mange the user in avoiding the consumption of content more than once based of specific predefined rules
Mode for any instance will be one to one mapping
With the extended design , tracking and monitoring of the user consumption can be done for any new context like program, event etc
Following are the different modes provided to new instance:
Scenario | Write Request | Read Request | |
---|---|---|---|
1 | Carry Forward Consumption
| { "userid": "<<userid>>" "collectionid" : "<<courseid>>", "contentid" :"<<contenid>>" } Note: Progress will be captured directly under the context | { "userId": "<<userid>>", "collectionId": "<<courseid>>", "contentId": "<<contentid>>" } |
2 | Copy Forward Consumption
| { "userid": "<<userid>>", "collectionid": "<<courseid>>", "contextid": "<<programid>>", "contentid": "<<contentid>>" } | { "userId": "<<userid>>", "collectionId": "<<courseid>>", "contentId": "<<contentid>>" } |
3 | Strict Mode Consumption
| { "userid": "<<userid>>", "collectionid": "<<courseid>>", "contextid": "<<programid>>", "contentid": "<<contenid>>" } |
Content View Lifecycle:
When the user view the content in context of a collection and batch, for the first time its start, progress update and end triggers are processed. Revisit (2nd - nth view) of the content will be ignored to process and update the DB.
Shall we enable force ‘view end’ to handle the collection progress update sync issues?
View Start API should insert the row only if the row not exists.
View Update and End API should update the row only if the row exists.
Handling collection and batch dependencies:
For view start, end and update, courseId and batchId are non-mandatory. This would enable to track the progress for any content which is not part of a course.
This is handled in two ways:
If, collectionId and batchId are part of the request, then, individual content progress and overall collection progress is captured and computed.
In case of only userId and contentId, the progress is captured only for that content
Handling Collection Data types in DB:
With normal collection types, the map values gets distributed to multiple sstables with append, which might lead to read latency issues
To the handle the scenario, will consider the frozen collection types, which will helpful in avoiding tombstone and multiple sstable reads
Current vs New (Viewer-Service) APIs:
We need to continue supporting the current APIs before deprecate and delete. So, it requires to work with both the APIs with backward compatibility.
Enhance Current APIs to read summary from aggregate table.
Enhance the below APIs to read the progress and score metrics from user_activity_agg
table.
Enrolment List API.
Content State Read API.
One time Data migration:
The content status
and score
metrics data should be updated to user_activity_agg
table from user_enrolemnts
and assessment_agg
table for all the existing enrolment records.
API Spec
Content View Start
Content View Update
Content Submit Assess
Content View End
Content View Read
Content Assesment Read
Viewer Summary - All enrolments
Viewer Summary - Specific enrolment
Viewer Summary Delete
Viewer Summary Download
Conclusion:
<TODO>