Request traceability across multiple sub-systems
Introduction:
This document describes how to trace/map the request across multiple sub-systems involved in process of serving/showing the details.
Background:
Jira Issue https://project-sunbird.atlassian.net/browse/SB-17028
Problem Statement:
How to analyze the time spent to serve the request?
How to analyze in which sub-system request is failing in the workflow?
Key design problems:
How to associate/relate the request across multiple sub-systems?
Workflow of QR-Code scan:
Solution 1:
Use existing telemetry events
{
...
"cdata": [
...
{
"type": "Trace"
"id": "2d4fd89a-5ed9-44ed-8efd-9fabedbe0f03" //UUID
}
],
...
}
or
{
...
"eid": "IMPRESSION"
"traceid": "2d4fd89a-5ed9-44ed-8efd-9fabedbe0f03" //UUID
...
}
Pros:
Easy to implement, because each sub-system is already logging the telemetry events
With minimal efforts, we achieve visualization in Druid/Superset
Cons:
The timestamp should be the same from all sub-systems (epoch time)
The finding of origin/root module of the trace depends on timestamp only. If any mismatch will lose the data.
Can’t send additional information required for request trace
Chances of losing traceability if any sub-system is failed to send trace unique id
Code change may impact on the existing traceability
Defining workflows, where the trace is going to end.
Solution 2:
Define a new telemetry event for trace information(similar to the LOG telemetry event). This is because along with traceID we should able to log additional information to track the request.
Trace object should contain(Span object should be similar to OpenTracing spec)
"edata": {
"id": {UUID}, // TraceId. Common for all sub-systems(this is to track entire workflow front-end to back-end)
"name": "qr-scan" // service/module name to identify
"span" {
"traceID": {UUID},
"operationName": {string} //ex: "qr-scan" for the qr scan workflow. Use some unique string for each workflow/span
"spanID": {worflowId}, // UniqueId for this workflow. Always same for multiple scans
"parentSpanId": {span.id} // Optional: This is to track who is the parent of the request. Helps to create the tree structure of the trace
"context": [{ "type":"", "key":"", "value":""}] // Context of the workflow, ex: "dialcode": "4XJG3F" on scan of this dial code workflow started
"tags": [{"key":"", "type":"", "value":""}] // https://github.com/opentracing/specification/blob/master/semantic_conventions.md#span-tags-table
}
}
OpenTracing API “SPAN“ spec:
Use OpenTracing API to generate Trace & Span objects. Jenkins & Jaeger have built on top of these API’s.
Sample Data to Analyze: https://docs.google.com/spreadsheets/d/1v96dZ0m21OVHswfZU84frOcymYd1kVqD7SqYLCyguFE/edit?usp=sharing
font-end implementation:
Use Javascript NPM module for front-end:
https://github.com/opentracing/opentracing-javascript
The singleton service/module which can hold only 1 trace workflow at any time. Angular components will use the below service to start & end the trace by sending action/workflow name. The trace can contain multiple span objects(which will be an enhancement of the below single-span implementation).
When the user scans/search with dial-code call TraceService.startTrace('qr-scan') method from the respective angular component. This will starts the new trace object.
After getting the response from API, call the below method to end the trace.
Use the decorator design pattern to add trace information in the requests. Call the
Back-end implementation:
refer: Best practices of open-tracing back-end implementation https://opentracing.io/docs/best-practices/
Any API is called from front-end pass the below details in the request headers.
OpenTacing API implementation in JAVA
https://github.com/opentracing/opentracing-java
Tracing Server Endpoints
When a server wants to trace execution of a request, it generally needs to go through these steps:
Attempt to extract a SpanContext that’s been propagated alongside the incoming request (in case the trace has already been started by the client), or start a new trace if no such propagated SpanContext could be found.
Store the newly created Span in some request context that is propagated throughout the application, either by application code, or by the RPC framework.
Finally, close the Span using
span.finish()
when the server has finished processing the request.
Extracting a SpanContext from an Incoming Request
Let’s assume that we have an HTTP server, and the SpanContext is propagated from the client via HTTP headers, accessible via request.headers
:
Here we use the headers
map as the carrier. The Tracer object knows which headers it needs to read in order to reconstruct the tracer state and any Baggage.
Pros:
Can able to find the time taken on each sub-modules easily based on SpanId
Time spent of each sub-system will be available directly in the event itself
In sub-systems, we can add multiple spans
Cons:
More efforts required compared to Solution1.
Multiple telemetry events will be logged for the same workflow(like impression & trace)
Solution 3:
Integrate with external distributed tracing tools like Zipkin, Jaeger etc..
JAEGER:
Architecture diagram:
Integration with spring application: Video- https://www.youtube.com/watch?v=hpnLUFRY4_Y
Integration with NodeJs application: https://blog.risingstack.com/distributed-tracing-opentracing-node-js/
Pros:
The simple configuration code changes will handle the request traceability.
Jaegar UI can be used to visualize the trace events.
Cons:
Workflows can’t be traceable. Each API will be considered as one trace.
Connecting multiple applications/servers to the single jaeger instance to trace the workflow of the API has to explore.
Reference links:
Opentracing specifications:
https://github.com/opentracing/specification/blob/master/specification.md
Span: Opentracing definition
Open tracing Javascript implementation sample
https://github.com/opentracing/opentracing-javascriptBest practices:
Jaeger related link
Jaeger UI setup: https://www.scalyr.com/blog/jaeger-tracing-tutorial/
Jaeger Client node: https://github.com/jaegertracing/jaeger-client-node
Related articles
Trace ability of request across subsystems