Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction:

This document describes the design to generate the data for the portal dashboards from Druid OLAP data store and export the report data to cloud storage. This mainly consists of following modules

  1. Configure Report Create report API - which is This API will be used to submit a request for configuration of a new report.
  2. Disable Report API - which is used to disable report generationThis API will mark an existing report as disabled and will be excluded from the list of reports to be executed.
  3. Report Data Generator - is The report data generator will be a spark job runs as per schedule time, generates data out of druid and exports data to cloud storage for each requests enabledwhich will generate report data file by executing the query configured in the report configuration against the druid data store. The report data file will then be exported to cloud storage to complete the report execution.

Configure Report API:

Input

...

  1. Channel Id
  2. Query Time Interval
  3. Report Config Json
  4. Query JsonReportName
  5. ChannelId
  6. ReportInterval (e.g. LAST_7_DAYS, LAST_MONTH, CUSTOM_DATES etc.)
  7. JsonQuery

Request Object

Code Block
languagejs
themeRDarkDJango
borderStylesolid
linenumberstrue
collapsefalse
{

   "id": "sunbird.analytics.report.submit",

   "ver": "1.0",
    "ts": "2019-03-07T12:40:40+05:30",
    "params": {
 
      "msgid": "4406df37-cd54-4d8a-ab8d-3939e0223580",
        "client_key": "analytics-team"
    },
    "request": {

       "channel_id": "in.ekstep",
      "report_name":"avg_collection_downloads"
      "report_interval": "Last 7 daysLAST_7_DAYS",
        "reportquery_configjson": {
            "idqueryType": "usagegroupBy",
            "labeldataSource": "Diksha Usage Report"telemetry-events",
  
         "titlegranularity": "Diksha Usage Reportday",
  
         "descriptiondimensions":[
"The report provides a quick summary of the data analysed by the analytics"eid"
team to track progess of Diksha across states. This report],
will be used to consolidate insights using various metrics on"aggregations":[
which Diksha is currently being mapped and will be shared on a weekly basis. The first section of the report will provide a snapshot of the overall health of the Diksha App. This will be followed by individual state sections that provide state-wise status of Diksha",{ "type":"count", "name":"context_did", fieldName":"context_did" }
         ],
         "filter":{
            "dataSourcetype": "/usage/report.json"and",
            "chartsfields": [
                { "type":"selector", "name":"eid", fieldName":"IMPRESSION" },
               { "datasetstype": [
        "selector", "name":"edata_type", fieldName":"detail" },
               { "type":"selector", "name":"edata_pageid", fieldName":"collection-detail" },
                        "dataExpr": "data.Number_of_downloads",
                            "label": "# of downloads"
{ "type":"selector", "name":"context_pdata_id", fieldName":"prod.diksha.app" }
            ]
          },
         "postAggregations":[
          ],     {
               "labelsExprtype": "data.Datearithmetic",

                   "chartTypename": "line""avg__edata_value",
                },
                {
    "fn":"/",
               "datasetsfields": [
    
                   {
                            "dataExpr": "data.Number_of_succesful_scans",
                            "label": "# of successful scans"
                        }{ "type":"fieldAccess", "name":"total_edata_value", "fieldName":"total_edata_value" },
                    ],
                    "labelsExpr": "data.Date",
                    "chartType": "bar"
{ "type":"fieldAccess", "name":"rows", "fieldName":"rows" }
               }
            ],
            "table":}
{                 "columnsExpr": "key",
 ],
              "valuesExprintervals": "tableData"
[
           },             "downloadUrl": "report_id1/report.csv"2019-02-20T00:00:00.000/2019-01-27T23:59:59.000"
        },
        "query": {

  ]
      }
    }
}

...

Approach 1:

...

The individual report configurations can be saved to a Cassandra table. The druid query JSON will be saved to Azure blob storage and the following will be the fields in the report configuration table. The report generator spark job will iterate through the list of enabled reports, insert a record into the platform_db.job_request table

...

Save druid query: We have following 2 approaches:
Approach 1: Save indivisual report queries seperatly in azure

...

druid/queries/report-id-1.txt
druid/queries/report-id-2.txt

Approach 2: Save all queries in one xml file like this

...

, generate the report from druid and update the status of the execution once the execution is complete. We can also include the report execution frequency if necessary.

  1. report_id
  2. report_name
  3. is_enabled
  4. report_interval
  5. report_configuration_location
  6. report_output_filename

Pros:

  1. Easier to edit the report query as it doesn't require any deployments.
  2. Managing individual report definition is easier as they are separated out.

Cons:

  1. As the number of reports increases, there will be too many druid query json files in the blob storage.
  2. The report configurations need to be managed in a separate Cassandra table. New fields like execution_frequency will require changes to the Cassandra table.

Approach 2:

A single report configuration xml file which will store individual report configurations. The report configuration xml needs to be read and the new reports appended to the existing list.  Each report section will be self-sufficient with all the required parameters for the report being part of the xml configuration file. 

Code Block
themeDJango
borderStylesolid
linenumberstrue

...

<?xml version = "1.0" encoding = "utf-8"?>

...


<reports>

...


  <report name="report1" id="report_1" enabled="true" output_filename="

...

report1_output.json">
    <query>

...


      

...

<![CDATA[

...


        {

...


           "queryType":

...

"groupBy",
           

...

"dataSource":

...

"telemetry-events",
           

...

"granularity":

...

"day",
           

...

"dimensions":

...

[
              "eid"
           ],

...


           "aggregations":

...

[

...


              { "type":

...

"count", "name":

...

"context_did", "fieldName":

...

"context_did" }

...


           ],

...


           "intervals": [
              "$report_interval"

...


           ],
           "filter":

...

{

...


              "type":

...

"and",
              

...

"fields":

...

[
                 { "type":

...

"selector", "dimension":

...

"eid", "value":

...

"IMPRESSION" },
                 { "type":

...

"selector", "dimension":

...

"edata_type", "value":

...

"detail" },
                 { "type":

...

"selector", "dimension":

...

"edata_pageid", "value":

...

"collection-detail" },
                 { "type":

...

"selector", "dimension":

...

"context_pdata_id", "value":

...

"prod.diksha.app" }
              

...

]

...


           },

...


           "postAggregations"

...

:

...

[{
                 "type"

...

:

...

"arithmetic",
                 "name"

...

:

...

"avg__edata_value",
                 "fn"

...

:

...

"/",
                 "fields":[
 

...

 

...

                  { "type"

...

:

...

"fieldAccess", "name"

...

:

...

"total_edata_value", "fieldName"

...

:

...

"total_edata_value" },
                    { "type"

...

:

...

"fieldAccess", "name"

...

:

...

"rows", "fieldName"

...

:

...

"rows" }
                 ]

...


            }]

...


        

...

}

...


      ]]>

...


   

...

</query>

...


   <query_params>

...


     

...

<param name="datasource" value="telemetry-events" optional="false"/>

...


     <param name="granularity" value="day" optional="false"/>
     

...

<param name="report_interval" value="LAST_7_DAYS"

...

/>
     <param name="start_date" value="2019-03-01" optional="true"/>

...


     

...

<param name="start_date" value="2019-03-07" optional="true"/>
   

...

</query_params>

...


 

...

</report>

...

</reports>

Disable Report API:

...


Pros:

  1. All report configurations are in one place and hence easier to maintain.
  2. It is easier to apply changes to report parameters or add new parameters to an existing report.

Cons:

  1. Readability becomes an issue in case of huge xml files.
  2. The entire report configuration needs to be read into memory when the job scheduler executes the report every day. 

Report Data Generator Data Product

...

  1. Input

...

  • Output:
    Uploads report output data to azure

...

    ...

      • Cassandra table where the reports are configured or the XML file with the report configurations.
    1. Algorithm
      • Obtain a list of reports which are enabled and extract the druid query from the configuration.
      • Execute the druid query for each of the reports within the Spark job.
    2. Output
      • Uploads the generated report with the configured filename to Azure blob storage.

    Disable Report API:

    1. Input
      • report_id
    2. Output
      • The report will be disabled either in the Cassandra table or the XML report configuration file.