Collections - redefining data model for scale

Introduction:

Platform enabled us, creating Collections by adding edge between content nodes. But, handling reads or writes when the collection graph has hundreds of nodes became difficult. This wiki has the design and implementation details about data model changes to support scale.

Problem statement:

When a textbook having hundreds of children constructing and returning the hierarchy structure or updating the hierarchy structure taking too long time. If this textbook is already published, in this case it is more complex to identify and fetch the hierarchy structure because, the graph structure will have published and current edit copy nodes.

A textbook having ~400 children taking ~8.5sec to fetch it's hierarchy.
A textbook having ~400 children taking 10 min to publish.

Key Design Problems:

Creating unit nodes in graph which always used under root.

When a collection created with multiple levels, graph will have unit content nodes with visibility: Parent. All these content nodes accessible via its root content only. These content nodes not giving much functional value.

Publishing all the unit nodes.

Because unit content nodes are part of the collection graph it is required to publish each and every unit content node from leaf to top. Due to this, it is taking more time to publish a collection which is having more unit nodes.

Editing a published collection creates more unit nodes.

On a published collection, any changes to the content nodes of collection will create it's image and apply the changes on image node. With this, there are more nodes and relations in the collection hierarchy structure.

Design:

Also, when the collection content published it should add relation between the resources and root collection object with metadata.

Graph unit nodes as body

Collection structure storing as body in Cassandra will improve the performance to get hierarchy and reduce the complexity of handling collection graph structure.

ECML content body has multiple stages and manifest. Similarly Collection body will have metadata of each node as nodes map and hierarchy.

ECML Body:

{

   "manifest" : {
       "media": [{...}],
   },
   "stages": [{...}],
   ....
}

Collection Body:

{
    "nodes": {
        "tb_1": {
            "name": "TextBook 1",
            "contentType": "TextBook",
            "visibility": "Default",
            "status": "Draft",
            .....
        },
        "tbu_1": {
            "name": "Unit -1",
            "contentType":"TextBookUnit",
            "visibility": "Parent",
            ....
        },
        ...
    },
    "hierarchy": {
       "tb_1": ["tbu_1", "tbu_2", ...],
       "tbu_1": ["tbu_1_1",...],
       ...
    }
}

Structure for collection body

The body structure of the collection should have minimal transformations and changes to read and write.

For the content before publishing more writes/updates will happen and after publishing more reads will happen. To support this, we will have different structure of data for the body before and after.

After publishing we always read the textbook hierarchy or it's children hierarchy (units with visibility: Parent). So, saving textbook full hierarchy in a column and all its children full hierarchy (for each) in another column in hierarchy_store.content_hierarchy.

Before publishing or when the collection is in edit mode, writes are more so, nodes and hierarchy as keys and nodes having only metadata of each node as map (identifier as key) and hierarchy having only its next level children using identifier will help making updates easy.

Note: In case of deleting a node from the hierarchy structure we should apply below logic to remove it from nodes map.

if visibility: Parent - remove.

if visibility: Default - check for all the children list and remove if it is not there in only one collection from which it is removing.

ES indexing of the collection and its children

Creating ES document or applying changes of ES document for each node in Collection hierarchy happened because they are saved as nodes in Neo4J graph. But, with this design the child units in a Collection will be saved in Cassandra DB. With existing all use cases, the child units are not required before publishing. So, publish process will apply changes for the child nodes of a collection as below.

search for all the documents from ES where visibility: Parent, parent: current collection id and compare Cassandra data and ES data

- if a node available only in ES - delete the document from ES.

- if a node available only in Cassandra - create new document in ES.

- if a node available in both - Update ES document using Cassandra metadata.

Collections - redefining data model for scale

Introduction:

Problem statement:

Key Design Problems:

Creating unit nodes in graph which always used under root.

Publishing all the unit nodes.

Editing a published collection creates more unit nodes.

Design:

Graph unit nodes as body

Structure for collection body

ES indexing of the collection and its children