Introduction

This wiki explains current design of course batches, problems with it and proposed changes to the design and implementation to resolve the current issues and able to handle scale.

Background & Problem statement:

In current state changes stored in lms service are not reactive to any changes in the course content. This structure creates the issue when some of the contents changes and Sunbird uses locally stored copy of the content. For example, Once the user has completed the full course, if any unit is added or removed from the course, the changes are not dynamically reflected. There are use cases to enable this.

Sunbird Design > Enable dynamic updation of learners' course progress > Untitled drawing (1).png

Key Design Problems:

1. Easy way to maintain course details for a batch.
2. Better handling of course progress computation.
3. Easy to fetch data for given course or batch or user.

Design:

Managing Course Details:

Batches are created for a course. Course details are most of the time constant but, when the creator publishes it the data will be updated. So, copying the course details and saving along with course batch and user_courses is not good.

Option 1: Maintain cache with ttl

Option 2: Always fetch it from source

Solution Approach

A complete solution can be divided into below points

remove the course data stored into db (ex. leafNodeCounts, courseName, tocUrl etc.)
Restructure the DB tables and corresponding ES
migrate the old data according to new structure
Code changes in accordance with the restructured data model
Change the sync process of user progress update to an async process with the help of kafka messaging.

Current DB structure related to course batch progress

course_batch (ES → cbatch)

column	type	description	indexed
id	text	batchid	PK
countdecrementdate	text
countdecrementstatus	boolean
countincrementdate	text
countincrementstatus	boolean
coursecreator	text		Y
courseid	text		Y
createdby	text
createddate	text
createdfor	list<text>
description	text
enddate	text
enrollmenttype	text		Y
hashtagid	text
mentors	list<text>
name	text
participant	map<text, boolean>
startdate	text
status	int		Y
updateddate	text

content_consumption

column	type	description	used	indexed	can be removed
id	text	hash of userid, batchid, courseid and contentid		PK
batchid	text
completedcount	int	how many time the content was completed
contentid	text			Y
contentversion	text
courseid	text			Y
datetime	timestamp
grade	text		N		Y
lastaccesstime	text
lastcompletedtime	text
lastupdatedtime	text
progress	int	progress pushed
result	text		N		Y
score	text		N		Y
status	int	1 for progress 2 for completed		Y
userid	text			Y
viewcount	int	how many times user viewed the content

user_courses (ES → usercourses)

column	type	description	used	indexed	can be removed
id	text	hash of userid, batchid and courseid		PK
active	boolean	turns false when user unenroll
addedby	text
batchid	text			Y
completedon	timestamp
contentid	text
courseid	text			Y
courselogourl	text				Y
coursename	text			Y	Y
datetime	timestamp
delta	text		N		Y
description	text				Y
enrolleddate	text
grade	text		N		Y
lastreadcontentid	text
lastreadcontentstatus	int
leafnodescount	int				Y
progress	int	count of content completed
status	int	1 for progress 2 for completed		Y
tocurl	text				Y
userid	text

DB Changes approach 1

Remove unused column from user_courses and move the content_consumption as a map into the user_courses table.

Steps involved

We cannot change the primary key of the table in cassandra, hence we would need to copy the data, drop the table, recreate the table with old data

COPY sunbird.user_courses (batchid,userid,courseid,contentid,active,addedby,datetime,enrolleddate,lastreadcontentid,lastreadcontentstatus,status,completedon) TO '../files/usercourses.csv' WITH HEADER = TRUE;

DROP TABLE sunbird.user_courses;

CREATE TABLE IF NOT EXISTS sunbird.user_courses(batchid text, userid text, courseId text, contentid text, active boolean, addedBy text, dateTime timestamp, enrolledDate text,lastReadContentId text,
lastReadContentStatus int,status int, completedon timestamp, contentconsumption map<text,frozen<map<text,text>>>, PRIMARY KEY (batchid,userid));

CREATE INDEX inx_ucs_userid ON sunbird.user_courses (userid);
CREATE INDEX inx_ucs_courseid ON sunbird.user_courses (courseid);
CREATE INDEX inx_ucs_status ON sunbird.user_courses (status);
CREATE INDEX inx_ucs_content_consumption ON sunbird.user_courses (contentconsumption);

COPY sunbird.user_courses (batchid,userid,courseid,contentid,active,addedby,datetime,enrolleddate,lastreadcontentid,lastreadcontentstatus,status,completedon) FROM '../files/usercourses.csv' WITH HEADER = TRUE;

2. Run a migration to copy from content_consumption to user_courses contentconsumption column as a map of map (can be a talend ETL job)

3. Code changes corresponsing to new DB structure

4. Progress can be calculated from contentconsumption column of user_courses

5. Corresponsing mapping changes in ES

Pros

all the user progress and content consumption is in the same row eradicates ambiguity.
easier to get the data based on batchid as it's the partition key, or batchid and userid.
A migration of data from content_consumption to user_courses will be needed.

Cons

the content/state/read is hampered as there is no way to output certain map component, hence the filtering of contents needs to be done in java, which is an overhead.
Related to above point, secondary indexes and data centralization is anti-pattern in cassandra.

DB Changes approach 2

This approach requires minimum db changes and no migration as we holds all the user_courses related info in user_courses as progress as map<text,int> which defines if a particular content is read or completed. Hence there is no reliance on content_consumption.

Any call to update the user progress will be forwarded to both the table user_courses and content_consumption. All the user course progress would be read from user_courses whereas content_consumption would be holding data as audit and for content/state/read call.

Currently we first put data into content_consumption and then based on it the entry is made into user_courses which is creating problems.

Pros

content/state/read would be efficient with userid as partition key and batchid/courseid as cluster key
no migration needed
user_courses will hold sufficient info to show a user's current progress.

Notes

→ Since sunbird will no longer store course info, all the course info would be fetched at runtime from KP.

→ A user's current progress and updates will be calculated whenever the api call happens, hence would provide updated state

→ Since sunbird would not be holding course details (like leafnodecount), all the background processes as configured in analytics team would need to fetch latest course details and would need to calculate the progress based on stored progress.