This issue can't be edited

Because it belongs to an archived project. Jira admins can restore projects from the archive.

POC for Reinforcement learning in Content query

Description

The current models for querying does not learn from user feedback. Implement a neural net based information retrieval system that uses user action( accept or reject a tag/ recommendation) to search Content in the future queries.

Start with a Siamese biLSTM architecture. Create a training data with Content and query pair based on MVC Content that are tagged to taxonomy. Queries are constructed using Chapter, topic, subtopic names and the text fields(description) of Content tagged to the topic/subtopic by curator. A Content already tagged is given a reward of 1. Use this for training the model.
Create a UI to crowd source recommended Content approval.
Create a scheduled task on MLWB to train the model for any new query - Content pair that gets generated through user engagement.

Design Document

None

Linked issues

is blocked by

SC-958

biLSTM: validation of results

SB-11949

Query- Document scoring model with ELMO embeddings

Activity

Show:

Soma Dhavala April 24, 2019 at 4:06 AM

It was noted that, at the time of evaluation, query is not scored against all the documents available, and it was squarely evaluated on the only two negative examples for every one positive examples. When a query is scored against all available documents, the accuracy dropped to slightly better than random. In a different experiment, ELMO embeddings was used. An ELMO model has to be tried remove one unveriffiable factor (embeddings used in this model). Since memory issues were observed while using ELMO embeddings, generators have to be used to load one batch at a time at the time of training

Anjana A G March 22, 2019 at 11:10 AM
Edited

Initial SiameseLSTM model experiment done by using the following data set to understand the entire architecture.

https://github.com/amansrivastava17/lstm-siamese-text-similarity/blob/master/sample_data.csv

Notes :

Inputs for the LSTM model are Query text, Document text, Feedback / User’s response. User’s response can be ‘0’ or ‘1’ depending on whether the user accept / reject the recommendation.

Output will be a similarity score between Query and document.

Parameters used for the LSTM model are

EMBEDDING_DIM = 50

MAX_SEQUENCE_LENGTH = 10

VALIDATION_SPLIT = 0.1

RATE_DROP_LSTM = 0.17

NUMBER_LSTM = 50

NUMBER_DENSE_UNITS = 50

ACTIVATION_FUNCTION = ‘relu’

Based on the Query - Document features, User response data on the Content Recommendation and the above parameters, LSTM model performance was very poor. Reasons can be:

There was an imbalance between the number of negative and positive responses. Basically data set was not a balanced one.
Very few common words between Query - Doc pairs.
Loss function used in the base model was Binary Cross Entropy. In the case of imbalance situation we could have chosen Categorical Hinge Fun.
Word2Vec representation of text will not be a good vector representation for the data set.

Based on these hypotheses, revised the model and important findings / thoughts are

Instead of Binary cross entropy, checked with Categorical Hinge function. Model prediction score seems to be almost similar for both loss function. So decided to go with Binary cross entropy which is already in the model.
Input for the LSTM model was the vector representation of Document - Query keywords. Instead of considering only keywords, to increase the number of common words between query - doc pair, add more features like titles, subtitles, description to each document.
To reduce imbalance in the data set, restructured the data set where each query have 1 positive response and 2 negative responses.
A sample data set and LSTM results are sharing here. https://docs.google.com/spreadsheets/d/1yOnYG5hmJpi9LFSA20Z6YDNb3tUbH8ivwXcZqJTp4nE/edit#gid=0

Findings: After a revision in the model based on the above hypotheses, prediction score has improved.

True Positive Rate = 99.82 %

True Negative Rate = 99.77 %

Accuracy = 99.79 %

( True positives = 1129, True negatives = 2248, False positives = 5, False negatives = 2 )

Thoughts: Balanced distribution of positive negative response in the data set can be the main reason for the better performance of the model. Since high accuracy leads to an ambiguous situation, need to revisit the entire Siamese LSTM procedure.

Pinned fields

Click on the next to a field label to start pinning.

Details
Assignee
Anjana A G
Reporter
Adarsa S
Labels
Content_discovery
Module
Data Science
Original estimate
4w
Time tracking
No time logged4w remaining
Components
Sprint
Ice-Box
Due date
Jan 25, 2019
Priority
Not Assigned

Invision for Jira

Created January 18, 2019 at 5:09 AM

Updated April 7, 2020 at 2:07 PM

This issue can't be edited

POC for Reinforcement learning in Content query

Description

Design Document

Linked issues

is blocked by

Activity

Soma Dhavala April 24, 2019 at 4:06 AM

Anjana A G March 22, 2019 at 11:10 AMEdited

DetailsAssigneeAnjana A GAnjana A GReporterAdarsa SAdarsa SLabelsContent_discoveryModuleData ScienceOriginal estimate4wTime trackingNo time logged4w remainingComponentsSprintIce-BoxDue dateJan 25, 2019PriorityNot Assigned

Details

Assignee

Reporter

Labels

Module

Original estimate

Time tracking

Components

Sprint

Due date

Priority

Invision for JiraOpen Invision for Jira

Invision for Jira

Anjana A G March 22, 2019 at 11:10 AM
Edited

Details
Assignee
Anjana A G
Reporter
Adarsa S
Labels
Content_discovery
Module
Data Science
Original estimate
4w
Time tracking
No time logged4w remaining
Components
Sprint
Ice-Box
Due date
Jan 25, 2019
Priority
Not Assigned

Invision for Jira