Enable dynamic updation of learners' course progress

Introduction

This wiki explains current design of course batches, problems with it and proposed changes to the design and implementation to resolve the current issues and able to handle scale.

Background & Problem statement:

In current state changes stored in lms service are not reactive to any changes in the course content. This structure creates the issue when some of the contents changes and Sunbird uses locally stored copy of the content. For example, Once the user has completed the full course, if any unit is added or removed from the course, the changes are not dynamically reflected. There are use cases to enable this.

SB-12089 - Getting issue details... STATUS



Key Design Problems:

1. Easy way to maintain course details for a batch.
2. Better handling of course progress computation.
3. Easy to fetch data for given course or batch or user.


Design:

Managing Course Details:

Batches are created for a course. Course details are most of the time constant but, when the creator publishes it the data will be updated. So, copying the course details and saving along with course batch and user_courses is not good.

Option 1: Maintain cache with ttl

Option 2: Always fetch it from source

Solution Approach

A complete solution can be divided into below points

  1. remove the course data stored into db (ex. leafNodeCounts, courseName, tocUrl etc.)
  2. Restructure the DB tables and corresponding ES
  3. migrate the old data according to new structure
  4. Code changes in accordance with the restructured data model
  5. Change the sync process of user progress update to an async process with the help of kafka messaging.


Current DB structure related to course batch progress

course_batch  (ES → cbatch)

columntypedescriptionusedindexedcan be removed
idtextbatchid
PK
countdecrementdatetext



countdecrementstatusboolean



countincrementdatetext



countincrementstatusboolean



coursecreatortext

Y
courseidtext

Y
createdbytext



createddatetext



createdforlist<text>



descriptiontext



enddatetext



enrollmenttypetext

Y
hashtagidtext



mentorslist<text>



nametext



participantmap<text, boolean>



startdatetext



statusint

Y
updateddatetext




content_consumption

columntypedescriptionusedindexedcan be removed
idtexthash of userid, batchid, courseid and contentid
PK
batchidtext



completedcountinthow many time the content was completed


contentidtext

Y
contentversiontext



courseidtext

Y
datetimetimestamp



gradetext
N
Y

lastaccesstime

text



lastcompletedtimetext



lastupdatedtimetext



progressintprogress pushed


resulttext
N
Y
scoretext
N
Y
statusint1 for progress 2 for completed
Y
useridtext

Y
viewcountinthow many times user viewed the content


user_courses (ES → usercourses)

columntypedescriptionusedindexedcan be removed
idtexthash of userid, batchid and courseid
PK
activebooleanturns false when user unenroll


addedbytext



batchidtext

Y
completedontimestamp



contentidtext



courseid

text

Y
courselogourltext


Y
coursenametext

YY
datetimetimestamp



deltatext
N
Y
descriptiontext


Y
enrolleddatetext



gradetext
N
Y
lastreadcontentidtext



lastreadcontentstatusint



leafnodescountint


Y
progressintcount of content completed


statusint1 for progress 2 for completed
Y
tocurltext


Y
useridtext



DB Changes approach 1

Remove unused column from user_courses and move the content_consumption as a map into the user_courses table.

Steps involved

  1. We cannot change the primary key of the table in cassandra, hence we would need to copy the data, drop the table, recreate the table with old data
COPY sunbird.user_courses (batchid,userid,courseid,contentid,active,addedby,datetime,enrolleddate,lastreadcontentid,lastreadcontentstatus,status,completedon) TO '../files/usercourses.csv' WITH HEADER = TRUE;

DROP TABLE sunbird.user_courses;

CREATE TABLE IF NOT EXISTS sunbird.user_courses(batchid text, userid text, courseId text, contentid text, active boolean, addedBy text, dateTime timestamp, enrolledDate text,lastReadContentId text,
lastReadContentStatus int,status int, completedon timestamp, contentconsumption map<text,frozen<map<text,text>>>, PRIMARY KEY (batchid,userid));

CREATE INDEX inx_ucs_userid ON sunbird.user_courses (userid);
CREATE INDEX inx_ucs_courseid ON sunbird.user_courses (courseid);
CREATE INDEX inx_ucs_status ON sunbird.user_courses (status);
CREATE INDEX inx_ucs_content_consumption ON sunbird.user_courses (contentconsumption);

COPY sunbird.user_courses (batchid,userid,courseid,contentid,active,addedby,datetime,enrolleddate,lastreadcontentid,lastreadcontentstatus,status,completedon) FROM '../files/usercourses.csv' WITH HEADER = TRUE;

2. Run a migration to copy from content_consumption to user_courses contentconsumption column as a map of map (can be a talend ETL job)

3. Code changes corresponsing to new DB structure

4. Progress can be calculated from contentconsumption column of user_courses

5. Corresponsing mapping changes in ES

Pros

  1. all the user progress and content consumption is in the same row eradicates ambiguity.
  2. easier to get the data based on batchid as it's the partition key, or batchid and userid.
  3. A migration of data from content_consumption to user_courses will be needed. 

Cons

  1. the content/state/read is hampered as there is no way to output certain map component, hence the filtering of contents needs to be done in java, which is an overhead.
  2. Related to above point, secondary indexes and data centralization is anti-pattern in cassandra.

DB Changes approach 2

This approach requires minimum db changes and no migration as we holds all the user_courses related info in user_courses as progress as map<text,int> which defines if a particular content is read or completed. Hence there is no reliance on content_consumption.

Any call to update the user progress will be forwarded to both the table user_courses and content_consumption. All the user course progress would be read from user_courses whereas content_consumption would be holding data as audit and for content/state/read call.

Currently we first put data into content_consumption and then based on it the entry is made into user_courses which is creating problems.

Pros

  1. content/state/read would be efficient with userid as partition key and batchid/courseid as cluster key
  2. no migration needed
  3. user_courses will hold sufficient info to show a user's current progress.


Notes

→ Since sunbird will no longer store course info, all the course info would be fetched at runtime from KP.

→ A user's current progress and updates will be calculated whenever the api call happens, hence would provide updated state

→ Since sunbird would not be holding course details (like leafnodecount), all the background processes as configured in analytics team would need to fetch latest course details and would need to calculate the progress based on stored progress.