Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Problem Statement:

...

       Use Elastic search scroll api . 'Scroll API ' can be used to retrieve large numbers of results (or even all results) from a single search request, it will work in same way as cursor on a traditional database.


ProsCons

We can retrieve large data set 

We can not use scroll api for real time user request

We can slice the data based upon shards 

Performance issues


Example:

...

Code Block
Path: /{{IndexName}}/{{type}}/_search?scroll=1m 
Request Data{ 
"query": {//Contains the query required to fetch the data
},
"size" : 1000,
}Returns → {"scrollId":"SCROLL ID"hits:["data"]}After receiving the scroll IdWe need to send this request till we get all the resultPath: /_search/scroll{
"scroll": "1m",
"scroll_id":"Scroll id" // received in the previous request}Returns {
"_scroll_id": "Scroll Id",
"hits": {
"total": 263,
"max_score": 0.11207403,
"hits": [ ]
}
}
Approach 1:

We can't start the service instantly , it should be Async process , and process id need to be track. This process will generate file and upload to some storage and link will be share to user on email. second time we might use same file for particular date range : Ex , if user request for stats for a course batch and for that course batch report is already generated and report validity time not expire then we can re-use it , instead of re-generating.


ProsCons
  1. No extra efforts require to handle the request 
  1. Duplicate request will entertain multiple timesMultiple time same calculations need to be done


Approach 2:

Start the service in some specified time of day.

We can queue all the request coming for batch stats. Our scheduler will check the queue and start processing the request one by one 

Pros

...

Cons
  1. We can avoid duplicate requests

...

We have to maintain a queue to handle all the request.


We have to schedule a scheduler.