...
This document describes the design to generate the data for the portal dashboards from Druid OLAP data store and export the report data to cloud storage. This mainly consists of following modules
- Create report Configure Report API - This API will be used to submit a request for configuration of a new report.
- Job Scheduler Engine - This Scheduler will submit the reports for execution based on execution frequency.
- Disable Report API - This API will mark an existing report as disabled and will be excluded from the list of reports to be executed.
- Report Data Generator - The report data generator will be a spark job which will generate report data file by executing the query configured in the report configuration against the druid data store. The report data file will then be exported to cloud storage to complete the report execution.
Configure Report API:
- Input:
-
ReportName
-
QueryEngine - druid/cassandra/ES
-
ExecutionFrequency - Daily/Weekly/Monthly
-
ChannelId
...
-
ReportInterval
-
Query
-
OutputFormat - json/csv
Request Object
Code Block language
...
theme
...
RDark borderStyle solid linenumbers true collapse false {
...
"id":"sunbird.analytics.report.submit",
...
"ver":"1.0",
...
"ts":"2019-03-07T12:40:40+05:30",
...
"params":{
...
"msgid":"4406df37-cd54-4d8a-ab8d-3939e0223580",
...
"client_key":"analytics-team"
...
},
...
"request":{
...
"channel_id":"in.ekstep",
...
"report_name":"avg_collection_downloads", "query_engine": "druid", "execution_frequency": "Daily", "report_interval":"LAST_7_DAYS", "output_format": "json" "query_json":{
...
"queryType":"groupBy",
...
"dataSource":"telemetry-events",
...
"granularity":"day",
...
"dimensions":[
...
"eid"
...
],
...
"aggregations":[
...
{ "type":"count", "name":"context_did", fieldName":"context_did" }
...
],
...
"filter":{
...
"type":"and",
...
"fields":[
...
{ "type":"selector", "name":"eid", fieldName":"IMPRESSION" },
...
{ "type":"selector", "name":"edata_type", fieldName":"detail" },
...
{ "type":"selector", "name":"edata_pageid", fieldName":"collection-detail" },
...
{ "type":"selector", "name":"context_pdata_id", fieldName":"prod.diksha.app" }
...
]
...
},
...
"postAggregations":[
...
{
...
"type":"arithmetic",
...
"name":"avg__edata_value",
...
"fn":"/",
...
"fields":[
...
{ "type":"fieldAccess", "name":"total_edata_value", "fieldName":"total_edata_value" },
...
{ "type":"fieldAccess", "name":"rows", "fieldName":"rows" }
...
]
...
}
...
],
...
"intervals":[
...
"2019-02-20T00:00:00.000/2019-01-27T23:59:59.000"
...
] }
...
} }
...
...
- Output:
The individual report configurations can be saved to a Cassandra table. The druid query JSON will be saved to Azure blob storage and the following will be the fields in the report configuration table
...
.
...
- report_id
- report_name
- is_enabled
- report_interval
- report_configuration_location
- report_output_filename
Pros:
- Easier to edit the report query as it doesn't require any deployments.
- Managing individual report definition is easier as they are separated out.
Cons:
- As the number of reports increases, there will be too many druid query json files in the blob storage.
- The report configurations need to be managed in a separate Cassandra table. New fields like execution_frequency will require changes to the Cassandra table.
...
A single report configuration xml file which will store individual report configurations. The report configuration xml needs to be read and the new reports appended to the existing list. Each report section will be self-sufficient with all the required parameters for the report being part of the xml configuration file.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
<?xml version = "1.0" encoding = "utf-8"?> <reports># Schema of table <report name="report1" id="report_1" enabled="true" output_filename="report1_output.json"> TABLE platform_db.druid_report_config ( <query> <![CDATA[ { "queryType":"groupBy"report_id text, "dataSource":"telemetry-events", "granularity":"day", "dimensions":[ "eid" ], "aggregations":[ { "type":"count", "name":"context_did", "fieldName":"context_did" } ]report_name text, "intervals": [ "$reportreport_interval" ] text, "filter":{ "type":"and"query_engine text, "fields":[ { "type":"selector", "dimension":"eid", "value":"IMPRESSION" }, { "type":"selector", "dimension":"edata_type", "value":"detail" }execution_frequency text, is_enabled boolean, { "type":"selector", "dimension":"edata_pageid", "value":"collection-detail" }report_query_location text, { "type":"selector", "dimension":"context_pdata_id", "value":"prod.diksha.app" } ] }report_output_format text, "postAggregations":[{ "type":"arithmetic", "name":"avg__edata_value", "fn":"/"report_output_location text, "fields":[ { "type":"fieldAccess", "name":"total_edata_value", "fieldName":"total_edata_value" }report_output_filename text, PRIMARY KEY (report_id) ); { "type":"fieldAccess", "name":"rows", "fieldName":"rows" } ] }] } ]]> </query> <query_params> <param name="datasource" value="telemetry-events" optional="false"/> <param name="granularity" value="day" optional="false"/> <param name="report_interval" value="LAST_7_DAYS"/> <param name="start_date" value="2019-03-01" optional="true"/> <param name="start_date" value="2019-03-07" optional="true"/> </query_params> </report> </reports> |
Pros:
- All report configurations are in one place and hence easier to maintain.
- It is easier to apply changes to report parameters or add new parameters to an existing report.
Cons:
- Readability becomes an issue in case of huge xml files.
- The entire report configuration needs to be read into memory when the job scheduler executes the report every day.
Report Data Generator Data Product
- Input
- Cassandra table where the reports are configured or the XML file with the report configurations.
- Algorithm
- Obtain a list of reports which are enabled and extract the druid query from the configuration.
- Execute the druid query for each of the reports within the Spark job.
- Output
- Uploads the generated report with the configured filename to Azure blob storage.
Disable Report API:
...
- report_id
...
)
|
Output location and file format of query:
Code Block theme RDark borderStyle solid /druid-reports/query/druid/report-id.json /druid-reports/query/cassandra/report-id.cql
Report Intervals:
Code Block theme RDark borderStyle solid YESTERDAY LAST_7_DAYS LAST_30_DAYS LAST_MONTH LAST_6_MONTHS LAST_YEAR StartDate(DD/MM/YYYY)-EndDate(DD/MM/YYYY)
Job Scheduler Engine:
- Input:
-
druid_report_config
cassandra table
- Output:
-
Based on execution_frequency in platform_db.druid_report_config
submits a request for execution by inserting a record in platform_db.job_request
Cassandra table.
Disable Report API:
- Input:
-
report-id
- Output:
-
The report will be disabled in the platform_db.druid_report_config
Cassandra table
Report Data Generator Data Product:
- Input:
-
Set of Requests - i.e
All records in platform_db.job_request
where status=SUBMITTED
and job_name=druid-reports
- Output:
-
Report data file will be saved in Azure with specified format- platform_db.job_request
table will be updated with job status and output file details will be updated in platform_db.druid_report_config
Output location and file format in Azure:
Once a request has been submitted and processing complete, the report data file with the name of the file being the report name suffixed with genaration date saved under :
Code Block | ||||
---|---|---|---|---|
| ||||
/druid-reports/report-id/report-name-mmddyyyy.csv
/druid-reports/report-id/report-name-mmddyyyy.json |