Experiments @Scale
Introduction
As we are moving to the next phase of the Sunbird development. We need to efficiently roll out the changes and new features in a controlled manner. If can leverage the existing user base to take the feedback. Monitor the usage and take the decision based on the (usage) data.
Let assume we are spending $100 to build a feature. that will affect 1000's of existing users.
To decide and test the which feature/ button placement works better for the users.
We need to take the new design to the users and take their feedback. Taking a feature to exist user and ask did they like it over the old design is very subjective. It requires a lot of energy and resource to get quality feedback. Along with that, it takes a long time and has a vulnerability to make the existing stable feature unstable.
A better way to do the same exercise is to choose an unbiased but relevant user of the same feature. Create two groups of these users for observation.
One group of the user will be shown the old design and the other group (control/test) will be shown the new design.
And collect the telemetry data for both groups.
based on the telemetry analysis, the user behaviour should tell which design/feature was used by what percentage of the users.
It also ensures that a stable branch of the feature remains stable and changes for the experiments are kept separate from the stable branch.
It should solve for
- Does the new feature /design improve the feature metrics?
- What persona is benefiting from the new design
- Is the hypothesis is holding good?
Does it improve the usage matrix of the feature with a new design or there is a drop in the usage matrix int the control group?
JTBD
- Jobs To Be Done:
- PMs wants to roll out the new features and changes to a set of user/devices selected based on criteria.
- PMs wants to measure success based on the telemetry data.
- PMs want to know what experiments are planned, active and running now.
- PMs want to know the users/devices under an experiment group and control group
- PMs want to know the users/devices selected /running more than one experiment.
- PMs want users to exit (force pull-out) an experiment.
- Users should be agnostic of the experiment.
- User should be able to opt-out of the experiment.
- User Personas: PM and users.
- System or Environment: Portal and Mobile (or both)
Requirement Specifications
Following the major story/feature required for experiments.
- Experiment definition
Lifecycle and rollout of an experiment
Separate Code experiment (Mobile) SC-1075
- Separate Code experiment (Portal)
- Rollback of the experiment
- Process and deployment at an experiment at scale
- Separate deployment of experiment
- "Mobile code push for an experiment: SB-13158
SC-1076" - Mobile code push via notification from the server
- Portal changes for an experiment
- API changes for existing services
- Offline Desktop experimentation
- "Mobile code push for an experiment: SB-13158
- Telemetry changes
- Telemetry to capture start, end expiry of the experiment.
- Telemetry to capture user opt-out (self), force removed(server), moved out on expiry of the experiment.
- Telemetry change in Mobile (ad-hoc data analysis) SC-1104
- Telemetry change in the portal (ad-hoc data analysis)
- Analytics, data product and dashboard
- User selection for an experiment
- Manual selection of device/user for experiment SC-1098
- Random selection of users in 2.2.0 -- SC-1074
- Criteria based selection of devices/users for experiments
- Switch user to Original path (out of experiment)
- Opt-out from client
- Opt-out from the backend (force pull out) SC-1106
- Auto opt-out from the client on expiry of experiment
- Governance changes (schema)
- Module-based experiment to allow multiple experiments per user or device
Experiment definition Overview
This story captures the experiments object need to be stored in the system.
An experiment is an alternative to the existing feature/ workflow or design, using which the experiment is done and user action/events are stored as telemetry to process later.
The experiment needs to be defined to capture all relevant details which should be used to build, deploy and analyse the details later.
Experiment definition details
-
SC-1103Getting issue details...
STATUS
This Story covers the experiment definition.
An experiment should be an alternative to the existing feature/workflow in the production system.
An experiment should have the following attribute
Identifier, unique id to recognise the experiment
channel: Mobile| Desktop| Portal
tag:-- Autogenerated.
There should be two group created one for experiment (modified path) and one for control group(orginal path).
Both group should have the same percentage(or number) of users. Where each group see the modified vs original workflow/feature.
control group is specially importance for the mobile (and offline desktop app). As there is extra download size involved in both the groups.
Experiment group will get new code/feature (downloaded) from the server.
Control group will see the same size of the code (modified path) but not used. This is specially required for the feature/UX experiment on the mobile app.
Where extra dowbload can impact on the usage (internet bandwidth, extra disc space on mobile).
Experiment definition
- Experiment definition
- Define the experiment.
- Define the modified design for Mobile/ portal
- Define the tags for telemetry
- Define the user selection criteria
- Define start and end date
- Define the metrics to be captured.
- Define if this experiment can be run in parallel or along with another experiment (multivariant) this will ensure small experiments to be executed simultaneously for a user or device.
- Experiment run should have.
Dates: Start, expiry end dates.
User/Device Selection criteria
user to experiment mapping.
user/device id.
Experiment id.
Experiment joined date.
Experiment exit date.
- Along with it record any tech object/attribute, experiment feature URL etc.
- The experiment should we allow the change/ modification of experiments,
- Change the user selection criteria.
- versioning the experiment changes with change in the experiment.
Deployment Lifecycle and rollout of an experiment overview
This story details the lifecycle of an experiment. An experiment goes through the following steps start to finish.
To roll out an experiment to the user, it needs to go through the following steps.
- Experiment code branch for mobile
- There should be a different code base for the experiment branch of mobile. This is code hygiene expectation such that. Experiment code and production codes are kept separate.
- No change in the experiment branch should adversely affect the Production branch.
- No extra testing effort on the Production branch.
- Deployment of the experiment branch is separate from the main branch
- Experiment code branch for portal
- There should be a different code base for the experiment branch of the portal. This is code hygiene expectation such that. Experiment code and production codes are kept separate.
No change in the experiment branch should adversely affect the Production branch.
No extra testing effort on the Production branch.
Deployment of the experiment branch is separate from the main branch
- There should be a different code base for the experiment branch of the portal. This is code hygiene expectation such that. Experiment code and production codes are kept separate.
- Rollback of the experiment
- Rollback of the experiment of the flaky experiment is deployed, It should also undo the data changes. And migrate the user to the original branch.
- Rollover the experiment with the new code bases. Show the user the new code base, the user remains in the experiment and sees the new experiment interface.
- Process and deployment at an experiment at scale
- Define a process with playbook and scripts to run an experiment. This should be used to define, create and deploy an experiment. This playbook should become a guideline for further execution experiments.
- Merge the experiment or discard from the branch.
- After the experiment is executed based on the metrics, the code should be merged with the production branch or discarded.
Deployment of Experiments
Experiments should be built separately from the production codebases. As feature branch and need to be deployed separately.
- Mobile code push for the experiment: SB-13158, SC-1076
- Mobile code push via notification from the server
- Portal changes for the experiment
- API changes for existing services
- Offline Desktop experimentation
Telemetry changes
This story is about the need of having the experiment info in the telemetry.
A user being agnostic to the experiment or changed workflow, he will perform the task intended. The user action will be recorded as telemetry and synced to the telemetry server.
Data product should be able to include them in the data products the also able to provide a separate analysis of the user/devices in the Experiment.
Product manager(s) and business should be able to look in the metric from both the groups and able to take a decision on it.
- Telemetry change in Mobile (ad-hoc data analysis) SC-1104
- Telemetry change in the portal (ad-hoc data analysis)
- Analytics, data product and dashboard
User selection for an experiment
When a user/device will visit (registered or anonymous )the sunbird, based on selection logic, the user will be either added to the experiment group or put in the control group (the original path ) by default, all users are in the control group. The user selection is based on the various attribute of a user profile or device or combination of both. the user selection criteria or query can be stored for later reference.
In the case of multiple (multivariant) experiment running for a user or device. Selection logic should take this into account and store this value.
- Manual selection of device/user for experiment SC-1098
- Random selection of users in 2.2.0 -- SC-1074
Criteria based selection of devices/users for experiments
- Ageing of the device should be updated daily,
- this should be executed for the user to initial/ trigger the experiment for user/ at the device.
- The same check should be executed daily at the exit.
- At expiry, the user should move out of the original path.
Switch user to Original path (out of experiment)
This story details the scenarios and requirement for opt-out from an experiment.
- Opt-out from client
- A user should be able to opt-in for experiments which are planned and or active, he should be able to do by providing a consent via setting in the user profile/ device.
- A user should be able to see the experiments running for him and he should be able to unjoin a planned of an active(running) experiment.
- Opt-out from the backend (force pull out) SC-1106
- Server-side, we should have the ability to move out a user from the experiment.
- On request of the user.
- user is reporting accepted workflow.
- We have enough user enrolled for the experiment.
- Enough data is captured for an experiment.
- PM wants to close the experiment early.
- The experiment is flaky (showing error, unaccepted behaviour) and needs to be closed early.
- Server-side, we should have the ability to move out a user from the experiment.
- Auto opt-out from the client on expiry of experiment
- Experiment duration has reached the expiry date, user/devices should switch back the control group/original path.
- For offline devices and offline desktop, this needs to happen automatically from the client-side. All the telemetry related to the experiment should be sent to the server.
event to capture the user moved to the original path should be capture as telemetry and send to the server.
Governance changes (schema)- Tech item
Module-based experiment to allow multiple experiments per user or device
<Use Case 1 - User Story 1> Overview
<Main Scenario>
Srl. No. | User Action | Expected Result |
---|---|---|
<Alternate Scenario 1>
Srl. No. | User Action | Expected Result |
---|---|---|
Exception Scenarios
Srl. No. | Error / Exception | Expected Handling |
---|---|---|
Wireframes
For Future Release
JIRA Ticket ID
<Use Case 1 - User Story 2> Overview
Localization Requirements
UI Element | Description | Language(s)/ Locales Required |
---|---|---|
Telemetry Requirements
Telemetry should have info of experiment active on the client-side.
Event Name | Description | Purpose |
---|---|---|
Telemetry for user selection in the experiment | this telemetry should capture the experiment added to user/device. | |
Telemetry to download of the code | Every experiment will need extra code push to the client, this telemetry event should capture the start and end of the code to the client This should also push the expiry date to the client. so that the client can switch to the original path |
|
Telemetry for the start of the experiment | This telemetry should provide the start of the experiment with experiment info. Active experiments, start timestamps | The purpose to capture at what time the experiment starts on the client. in case of the portal, it would be immediate but on the mobile or offline desktop app, it will be delayed. |
Telemetry for the end of the experiment | This telemetry should capture the details when an experiment has reached to the end on the client, and the user/device has moved to the original path. It should also capture how the movement of user from the experiment was triggered. be it client opt-out, forced from server or expiry of the experiment. | This is required to know if users facing the issues. How many users are opting out? how many are forced out? and how many have completed the experiment. |
Non-Functional Requirements
Non-functional requirements
Performance / Responsiveness Requirements | Load/Volume Requirements | Security / Privacy Requirements |
---|---|---|
All users/devices should be checked for the selection of the experiment. | Potentially user/device from the portal or mobile app can be either in an experiment or control group. So user selection check will happen for users/devices coming to Sunbird. User check response should be under a second response. | |
Code push | Code push should be in the background and under 2 mins for all connection type. |
Impact on other Products/Solutions
Product/Solution Impacted | Impact Description |
---|---|
Impact on Existing Users/Data
User/Data Impacted | Impact Description |
---|---|
Key Metrics
Srl. No. | Metric | Purpose of Metric |
---|---|---|