Table of Contents |
---|
Introduction:
This document describes how to trace/map the request across multiple sub-systems involved in process of serving/showing the details.
Background:
Jira Issue https://project-sunbird.atlassian.net/browse/SB-17028
Problem Statement:
How to analyze the time spent to serve the request?
How to analyze in which sub-system request is failing in the workflow?
Key design problems:
How to associate/relate the request across multiple sub-systems?
Workflow of QR-Code scan:
...
Solution 1:
Use existing telemetry events
...
Code Block |
---|
{ ... "eid": "IMPRESSION" "traceid": "2d4fd89a-5ed9-44ed-8efd-9fabedbe0f03" //UUID ... } |
Pros:
Easy to implement, because each sub-system is already logging the telemetry events
With minimal efforts, we achieve visualization in Druid/Superset
Cons:
The timestamp should be the same from all sub-systems (epoch time)
The finding of origin/root module of the trace depends on timestamp only. If any mismatch will lose the data.
Can’t send additional information required for request trace
Chances of losing traceability if any sub-system is failed to send trace unique id
Code change may impact on the existing traceability
Defining workflows, where the trace is going to end.
Solution 2:
Define a new telemetry event for trace information(similar to the LOG telemetry event). This is because along with traceID we should able to log additional information to track the request.
...
Sample Data to Analyze: https://docs.google.com/spreadsheets/d/1v96dZ0m21OVHswfZU84frOcymYd1kVqD7SqYLCyguFE/edit?usp=sharing
font-end implementation:
Use Javascript NPM module for front-end:
...
Use the decorator design pattern to add trace information in the requests. Call the
...
Back-end implementation:
refer: Best practices of open-tracing back-end implementation https://opentracing.io/docs/best-practices/
...
Code Block |
---|
"traceId": { UUID } "name": "qr-code" // for QR code scane. Name of the workflow to trace. "context": [{ "type": "key", "value":""}] |
Tracing Server Endpoints
When a server wants to trace execution of a request, it generally needs to go through these steps:
Attempt to extract a SpanContext that’s been propagated alongside the incoming request (in case the trace has already been started by the client), or start a new trace if no such propagated SpanContext could be found.
Store the newly created Span in some request context that is propagated throughout the application, either by application code, or by the RPC framework.
Finally, close the Span using
span.finish()
when the server has finished processing the request.
Extracting a SpanContext from an Incoming Request
Let’s assume that we have an HTTP server, and the SpanContext is propagated from the client via HTTP headers, accessible via request.headers
:
...
Here we use the headers
map as the carrier. The Tracer object knows which headers it needs to read in order to reconstruct the tracer state and any Baggage.
Pros:
Can able to find the time taken on each sub-modules easily based on SpanId
Time spent of each sub-system will be available directly in the event itself
In sub-systems, we can add multiple spans
Cons:
More efforts required compared to Solution1.
Multiple telemetry events will be logged for the same workflow(like impression & trace)
Solution 3:
Integrate with external distributed tracing tools like Zipkin, Jaeger etc..
JAEGER:
Architecture diagram:
...
Integration with spring application: Video- https://www.youtube.com/watch?v=hpnLUFRY4_Y
Integration with NodeJs application: https://blog.risingstack.com/distributed-tracing-opentracing-node-js/
Pros:
The simple configuration code changes will handle the request traceability.
Jaegar UI can be used to visualize the trace events.
Cons:
Workflows can’t be traceable. Each API will be considered as one trace.
Connecting multiple applications/servers to the single jaeger instance to trace the workflow of the API has to explore.
Reference links:
Opentracing specifications:
https://github.com/opentracing/specification/blob/master/specification.md
Span: Opentracing definition
Open tracing Javascript implementation sample
https://github.com/opentracing/opentracing-javascriptBest practices:
...
Jaeger Client node: https://github.com/jaegertracing/jaeger-client-node
...
Related articles
Trace ability of request across subsystems
...