Indexing Content Model to Druid


Introduction:

In this wiki, we are going to discuss methods to index Content model to Druid and its challenges. Since the content model is not time series data, updating the indexed data in Druid is not possible and this would pose challenges to query data, we will discuss more on this in the following sections.


  1. Content Data Model
  2. Index using transaction logs from Kafka
  3. Index using Elasticsearch Snapshot

1. Content Data Model:

Below table has content model fields that are being indexed. (Content model has more fields than the list below. For simplicity, we have chosen some of the required fields which is useful)


Sno.fieldsData Typefield in Druid
1authorStringauthor
2badgeAssertions.assertionIdStringbadgeAssertions_assertionId
3badgeAssertions.badgeClassIdStringbadgeAssertions_badgeClassId
4badgeAssertions.badgeClassImageStringbadgeAssertions_badgeClassImage
5badgeAssertions.badgeClassNameStringbadgeAssertions_badgeClassName
6badgeAssertions.badgeIdStringbadgeAssertions_badgeId
7badgeAssertions.createdTSStringbadgeAssertions_createdTS
8badgeAssertions.issuerIdString

badgeAssertions_issuerId

9badgeAssertions.statusStringbadgeAssertions_status
10boardString

board

11channelStringchannel
12compatibilityLevelString compatibilityLevel
13contentTypeStringcontentType
14createdByStringcreatedBy
15createdForStringcreatedFor
16createdOnStringcreatedOn
17creatorString creator
18dialcodesStringdialcodes
19frameworkString framework
20gradeLevelString gradeLevel
21identifierStringidentifier
22keywordsStringkeywords
23languageStringlanguage
24lastPublishedByStringlastPublishedBy
25lastPublishedOnString lastPublishedOn
26lastSubmittedOnStringlastSubmittedOn
27lastUpdatedByStringlastUpdatedBy
28lastUpdatedOnStringlastUpdatedOn
29licenseStringlicense
30mediaTypeStringmediaType
31mediumStringmedium
32mimeTypeStringmimeType
33nameStringname
34objectTypeStringobjectType
35organisationStringorganisation
36originStringorigin
37ownerStringowner
38pkgVersionLongpkgVersion
39resourceTypeStringresourceType
40statusStringstatus
41subjectStringsubject
42topicStringtopic
43me_audiosCountlongSumme_audiosCount
44me_averageInteractionsPerMindoubleSumme_averageInteractionsPerMin
45me_averageRatingdoubleSumme_averageRating
46me_averageSessionsPerDevicedoubleSumme_averageSessionsPerDevice
47me_averageTimespentPerSessiondoubleSumme_averageTimespentPerSession
48me_avgCreationTsPerSessiondoubleSumme_avgCreationTsPerSession
49me_creationSessionslongSumme_creationSessions
50me_creationTimespentdoubleSumme_creationTimespent
51me_hierarchyLevellongSumme_hierarchyLevel
52me_imagesCountlongSumme_imagesCount
53me_timespentDraftdoubleSumme_timespentDraft
54me_timespentReviewdoubleSumme_timespentReview
55me_totalCommentslongSumme_totalComments
56me_totalDeviceslongSumme_totalDevices
57me_totalDialcodeAttachedlongSumme_totalDialcodeAttached
58me_totalDialcodeLinkedToContentlongSumme_totalDialcodeLinkedToContent
59me_totalDownloadslongSumme_totalDownloads
60me_totalInteractionslongSumme_totalInteractions
61me_totalRatingslongSumme_totalRatings
62me_totalSessionsCountlongSumme_totalSessionsCount
63me_totalSideloadslongSumme_totalSideloads
64me_totalTimespentdoubleSumme_totalTimespent
65me_videosCountlongSumme_videosCount
66timestampLongtimestamp
67versionLong

version

68programIdString programId
69typeStringtype
70categoryStringcategory
71learningOutcome
learningOutcome
72qumlVersionLongqumlVersion
73bloomsLevel
bloomsLevel
74rejectCommentStringrejectComment

2. Index using Transactional logs from Kafka:



The idea here is to index the snapshot of Content model initially and append the updated records using the transactional logs(learning graph events) from Kafka which is generated by Learning platform when content is modified/created/removed. Here, the index field would be createdOn. With this approach, querying becomes highly difficult due to multiple versions of the same record created by the update operation.

we can overcome the query problem when an existing record is indexed using Druid's First/Last Aggregator but this requires a metric field to be populated (for instance, status field need to transform into `status_metric` field) to get the recently updated record based on timestamp. This also requires the query to be intercepted and add first/last aggregation to work on the recent records otherwise it would fetch all the versions of the same records.  This approach is tedious and error-prone due to the above-mentioned problem.

Pros:

  1. Maintenance is less since we are using a streaming job to index the data.
  2. Consistent state of data.

Cons:

  1. Multiple versions of the same record, which is less useful as it grows.
  2. Creates a lot of partition within a segment due to modifications.
  3. Difficult to query due to multiple versions of the same record.  


3. Index using Elasticsearch Snapshot:



Here the approach is to take the snapshot of the content model index from Elasticsearch and populate it into Druid. This is a one-time activity and it should be done periodically (for instance, daily once) to keep the state of the data consistent. Here the index field would be `timestamp` (time captured when the script starts to execute) .we can delete the older snapshots indexed to Druid during indexing which is not useful. 

Time taken to create a snapshot from Elasticsearch index < 1min

Time taken to ingest 493k records (compositesearch index) is < 5min

Re-indexing: Periodically, we have to replace the Druid datasource with a new content snapshot. We can delete the old datasource from Druid and create a new Datasource with a fresh copy of content model snapshot.

Pros:

  1. Creates only one segment since it is a batch index
  2. No multiple versions of the same records created.
  3. querying is easy and performant.

Cons:

  1. It is a Batch operation.

Conclusion:

we have tried out both the approach (2 & 3) and found that indexing transactional logs(2) require more work to make it easy to query from superset or from the Druid API and also it creates a lot of segments and partitions which might lead to slower query response. We are going ahead with the 3rd approach which is indexing content model snapshot from the script because of the advantages mentioned above.