Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 19 Next »

Introduction:

This document describes the design to generate the data for the portal dashboards from Druid OLAP data store and export the report data to cloud storage. This mainly consists of following modules

  1. Configure Report API - This API will be used to submit a request for configuration of a new report.
  2. Job Scheduler Engine - This Scheduler will submit the reports for execution based on execution frequency.
  3. Disable Report API - This API will mark an existing report as disabled and will be excluded from the list of reports to be executed.
  4. Report Data Generator - The report data generator will be a spark job which will generate report data file by executing the query configured in the report configuration against the druid data store. The report data file will then be exported to cloud storage to complete the report execution.


Configure Report API:

  • Input:

- ReportName

- QueryEngine - druid/cassandra/ES

- ExecutionFrequency - DAILY/WEEKLY/MONTHLY

- ChannelId

- ReportInterval

- Query

- OutputFormat - json/csv

- OutputFilePattern - list of placeholders in filename creation

  • Request Object

     {
      "id":"sunbird.analytics.report.submit",
      "ver":"1.0",
      "ts":"2019-03-07T12:40:40+05:30",
      "params":{
         "msgid":"4406df37-cd54-4d8a-ab8d-3939e0223580",
         "client_key":"analytics-team"
      },
      "request":{
         "channel_id":"in.ekstep",
         "report_name":"avg_collection_downloads",
         "query_engine": "druid",
         "execution_frequency": "DAILY",
         "report_interval":"LAST_7_DAYS",
         "output_format": "json"
         "query_json":{
            "queryType":"groupBy",
            "dataSource":"telemetry-events",
            "granularity":"day",
            "dimensions":[
               "eid"
            ],
            "aggregations":[
               { "type":"count", "name":"context_did", fieldName":"context_did" }
            ],
            "filter":{
               "type":"and",
               "fields":[
                  { "type":"selector", "name":"eid", fieldName":"IMPRESSION" },
                  { "type":"selector", "name":"edata_type", fieldName":"detail" },
                  { "type":"selector", "name":"edata_pageid", fieldName":"collection-detail" },
                  { "type":"selector", "name":"context_pdata_id", fieldName":"prod.diksha.app" }
               ]
            },
            "postAggregations":[
               {
                  "type":"arithmetic",
                  "name":"avg__edata_value",
                  "fn":"/",
                  "fields":[
                     { "type":"fieldAccess", "name":"total_edata_value", "fieldName":"total_edata_value" },
                     { "type":"fieldAccess", "name":"rows", "fieldName":"rows" }
                  ]
               }
            ],
            "intervals":[
               "2019-02-20T00:00:00.000/2019-01-27T23:59:59.000"
            ]
         }
      }
     }
     
  • Output:

The individual report configurations can be saved to a Cassandra table. The druid query JSON will be saved to Azure blob storage and the following will be the fields in the report configuration table.


   # Schema of table
   TABLE platform_db.druid_reports_configuration (
     report_id text,
     report_name text,
     report_interval text,
     query_engine text,
     cron_expression text,
     status text,
     report_query_location text,
     report_output_format text,
     report_output_location text,
     report_output_filename text,
     report_output_file_pattern list<text>,
     PRIMARY KEY (report_id) );
   )
  • Location and file format of query in azure:

    /druid-reports/query/druid/report-id.json
    /druid-reports/query/cassandra/report-id.cql
    
  • Report Intervals:

       YESTERDAY
       LAST_7_DAYS
       LAST_30_DAYS
       LAST_MONTH
       LAST_6_MONTHS
       LAST_YEAR 
  • Available Placeholders for output file pattern:
ReportName
Date
Timestamp



Job Scheduler Engine:



  • Input:

- A list of reports in druid_reports_configuration Cassandra table with the cron_expression which falls within the current day of execution.

  • Algorithm:

- Data availability check has following 2 criteria:

    1. Kafka indexing lag: check for 0 lag in druid ingestion.

    2. Druid segments count: Based on previous segments trend will set some threshold range. Segments count for previous day should fall in that threshold range.

- Reports based on telemetry-events will be submitted for execution upon satisfying both the criteria.

- Reports based on summary-events will be submitted for execution upon satisfying only 2nd criteria or check for files in azure. 

  • Output:

- The list of reports are submitted for execution into the platform_db.job_request Cassandra table with the status=SUBMITTED and job_name=druid-reports-<report-id>.

Disable Report API:

  • Input:

- report-id

  • Output:

- The report will be disabled in the platform_db.druid_reports_configuration Cassandra table

Report Data Generator Data Product:

  • Input:

Set of Requests - i.e All records in platform_db.job_request where status=SUBMITTED and job_name starts with druid-reports

  • Output:

-  Report data file will be saved in Azure with specified format
platform_db.job_request table will be updated with job status and output file details will be updated in platform_db.druid_reports_configuration

  • Output location and file format in Azure:

Once a request has been submitted and processing complete, the report data file with the name of the file being the report name suffixed with genaration date saved under :

   /druid-reports/report-id/report-name-mmddyyyy.csv
   /druid-reports/report-id/report-name-mmddyyyy.json


  • No labels