Monorail Machine Learning Classifiers

Monorail has two machine learning classifiers running in ML Engine: a spam classifier and a component predictor.

Whenever a user creates a new issue (or comments on an issue without an assigned component), components are suggested based on the text the user types using Monorail's component predictor.

Monorail also runs each new issue and comment through a spam classifier model.

In order to train a new model locally or in the cloud, follow the instructions below.

Note: you must be logged into the correct GCP project with gcloud in order to run the below commands.

New model in trainer2/

The new code is used for local training and exporting model using Python3 and TensorFlow 2.0. Future predictor should also be migrated to use the training files in trainer2/.

Trainer

Both trainers are Python modules that do the following:

  1. Download all (spam or component) exported training data from GCS
  2. Define a TensorFlow Estimator and Experiment

ML Engine uses the high-level learn_runner API (see trainer/task.py) which allows it to train, evaluate, and predict against a model saved in GCS.

Monorail Spam Classifier

Run locally

To run any training jobs locally, you'll need Python 2 and TensorFlow 1.2:

pip install -r requirements.txt

Run a local training job with placeholder data:

make TRAIN_FILE=./sample_spam_training_data.csv train_local_spam

To have the local trainer download and train on the real training data, you'll need to be logged into gcloud and have access to the monorail-prod project.

make train_from_prod_data_spam

Submit a local prediction

./spam.py local-predict
gcloud ml-engine local predict --model-dir $OUTPUT_DIR/export/Servo/{TIMESTAMP}/ --json-instances /tmp/instances.json

Submitting a training job to ML Engine

This will run a job and output a trained model to GCS. Job names must be unique.

First verify you're in the monorail-prod GCP project.

gcloud init

To submit a training job manually, run:

TIMESTAMP=$(date +%s)
JOB_NAME=spam_trainer_$TIMESTAMP
gcloud ml-engine jobs submit training $JOB_NAME \
    --package-path trainer/ \
    --module-name trainer.task \
    --runtime-version 1.2 \
    --job-dir gs://monorail-prod-mlengine/$JOB_NAME \
    --region us-central1 \
    -- \
    --train-steps 1000 \
    --verbosity DEBUG \
    --gcs-bucket monorail-prod.appspot.com \
    --gcs-prefix spam_training_data \
    --trainer-type spam

Uploading a model and and promoting it to production

To upload a model you'll need to locate the exported model directory in GCS. To do that, run:

gsutil ls -r gs://monorail-prod-mlengine/$JOB_NAME

# Look for a directory that matches the below structure and assign it.
# It should have the structure $GCS_OUTPUT_LOCATION/export/Servo/$TIMESTAMP/.
MODEL_BINARIES=gs://monorail-prod-mlengine/spam_trainer_1507059720/export/Servo/1507060043/

VERSION=v_$TIMESTAMP
gcloud ml-engine versions create $VERSION \
    --model spam_only_words \
    --origin $MODEL_BINARIES \
    --runtime-version 1.2

To promote to production, set that model as default.

gcloud ml-engine versions set-default $VERSION --model spam_only_words

Submit a prediction

Use the script spam.py to make predictions from the command line. Files containing text for classification must be provided as summary and content arguments.

$ ./spam.py predict --summary summary.txt --content content.txt
{u'predictions': [{u'classes': [u'0', u'1'], u'scores': [0.4986788034439087, 0.5013211965560913]}]}

A higher probability for class 1 indicates that the text was classified as spam.

Compare model accuracy

After submitting a job to ML Engine, you can compare the accuracy of two submitted jobs using their trainer names.

$ ./spam.py --project monorail-prod compare-accuracy --model1 spam_trainer_1521756634 --model2 spam_trainer_1516759200
spam_trainer_1521756634:
AUC: 0.996436  AUC Precision/Recall: 0.997456

spam_trainer_1516759200:
AUC: 0.982159  AUC Precision/Recall: 0.985069

By default, model1 is the default model running in the specified project. Note that an error will be thrown if the trainer does not contain an eval_data.json file.

Monorail Component Predictor

Run locally

To kick off a local training job, run:

OUTPUT_DIR=/tmp/monospam-local-training
rm -rf $OUTPUT_DIR
gcloud ml-engine local train \
    --package-path trainer/ \
    --module-name trainer.task \
    --job-dir $OUTPUT_DIR \
    -- \
    --train-steps 10000 \
    --eval-steps 1000 \
    --verbosity DEBUG \
    --gcs-bucket monorail-prod.appspot.com \
    --gcs-prefix component_training_data \
    --trainer-type component

Submitting a training job to ML Engine

This will run a job and output a trained model to GCS. Job names must be unique.

First verify you're in the monorail-prod GCP project.

gcloud init

To submit a training job manually, run:

TIMESTAMP=$(date +%s)
JOB_NAME=component_trainer_$TIMESTAMP
gcloud ml-engine jobs submit training $JOB_NAME \
    --package-path trainer/ \
    --module-name trainer.task \
    --runtime-version 1.2 \
    --job-dir gs://monorail-prod-mlengine/$JOB_NAME \
    --region us-central1 \
    --scale-tier custom \
    --config config.json \
    -- \
    --train-steps 10000 \
    --eval-steps 1000 \
    --verbosity DEBUG \
    --gcs-bucket monorail-prod.appspot.com \
    --gcs-prefix component_training_data \
    --trainer-type component

Uploading a model and and promoting it to production

To upload a model you'll need to locate the exported model directory in GCS. To do that, run:

gsutil ls -r gs://monorail-prod-mlengine/$JOB_NAME

# Look for a directory that matches the below structure and assign it.
# It should have the structure $GCS_OUTPUT_LOCATION/export/Servo/$TIMESTAMP/.
MODEL_BINARIES=gs://monorail-prod-mlengine/component_trainer_1507059720/export/Servo/1507060043/

VERSION=v_$TIMESTAMP
gcloud ml-engine versions create $VERSION \
    --model component_top_words \
    --origin $MODEL_BINARIES \
    --runtime-version 1.2

To promote to production, set that model as default.

gcloud ml-engine versions set-default $VERSION --model component_top_words

Submit a prediction

Use the script component.py to make predictions from the command line. A file containing text for classification must be provided as the content argument.

$ ./component.py --project monorail-prod --content content.txt
Most likely component: index 108, component id 36250211