Monorail has two machine learning classifiers running in ML Engine: a spam classifier and a component predictor.
Whenever a user creates a new issue (or comments on an issue without an assigned component), components are suggested based on the text the user types using Monorail's component predictor.
Monorail also runs each new issue and comment through a spam classifier model.
In order to train a new model locally or in the cloud, follow the instructions below.
Note: you must be logged into the correct GCP project with
gcloud
in order to run the below commands.
The new code is used for local training and exporting model using Python3 and TensorFlow 2.0. Future predictor should also be migrated to use the training files in trainer2/.
Both trainers are Python modules that do the following:
ML Engine uses the high-level learn_runner
API (see trainer/task.py
) which allows it to train, evaluate, and predict against a model saved in GCS.
To run any training jobs locally, you'll need Python 2 and TensorFlow 1.2:
pip install -r requirements.txt
Run a local training job with placeholder data:
make TRAIN_FILE=./sample_spam_training_data.csv train_local_spam
To have the local trainer download and train on the real training data, you'll need to be logged into gcloud
and have access to the monorail-prod
project.
make train_from_prod_data_spam
./spam.py local-predict gcloud ml-engine local predict --model-dir $OUTPUT_DIR/export/Servo/{TIMESTAMP}/ --json-instances /tmp/instances.json
This will run a job and output a trained model to GCS. Job names must be unique.
First verify you're in the monorail-prod
GCP project.
gcloud init
To submit a training job manually, run:
TIMESTAMP=$(date +%s) JOB_NAME=spam_trainer_$TIMESTAMP gcloud ml-engine jobs submit training $JOB_NAME \ --package-path trainer/ \ --module-name trainer.task \ --runtime-version 1.2 \ --job-dir gs://monorail-prod-mlengine/$JOB_NAME \ --region us-central1 \ -- \ --train-steps 1000 \ --verbosity DEBUG \ --gcs-bucket monorail-prod.appspot.com \ --gcs-prefix spam_training_data \ --trainer-type spam
To upload a model you'll need to locate the exported model directory in GCS. To do that, run:
gsutil ls -r gs://monorail-prod-mlengine/$JOB_NAME # Look for a directory that matches the below structure and assign it. # It should have the structure $GCS_OUTPUT_LOCATION/export/Servo/$TIMESTAMP/. MODEL_BINARIES=gs://monorail-prod-mlengine/spam_trainer_1507059720/export/Servo/1507060043/ VERSION=v_$TIMESTAMP gcloud ml-engine versions create $VERSION \ --model spam_only_words \ --origin $MODEL_BINARIES \ --runtime-version 1.2
To promote to production, set that model as default.
gcloud ml-engine versions set-default $VERSION --model spam_only_words
Use the script spam.py
to make predictions from the command line. Files containing text for classification must be provided as summary and content arguments.
$ ./spam.py predict --summary summary.txt --content content.txt {u'predictions': [{u'classes': [u'0', u'1'], u'scores': [0.4986788034439087, 0.5013211965560913]}]}
A higher probability for class 1 indicates that the text was classified as spam.
After submitting a job to ML Engine, you can compare the accuracy of two submitted jobs using their trainer names.
$ ./spam.py --project monorail-prod compare-accuracy --model1 spam_trainer_1521756634 --model2 spam_trainer_1516759200 spam_trainer_1521756634: AUC: 0.996436 AUC Precision/Recall: 0.997456 spam_trainer_1516759200: AUC: 0.982159 AUC Precision/Recall: 0.985069
By default, model1 is the default model running in the specified project. Note that an error will be thrown if the trainer does not contain an eval_data.json file.
To kick off a local training job, run:
OUTPUT_DIR=/tmp/monospam-local-training rm -rf $OUTPUT_DIR gcloud ml-engine local train \ --package-path trainer/ \ --module-name trainer.task \ --job-dir $OUTPUT_DIR \ -- \ --train-steps 10000 \ --eval-steps 1000 \ --verbosity DEBUG \ --gcs-bucket monorail-prod.appspot.com \ --gcs-prefix component_training_data \ --trainer-type component
This will run a job and output a trained model to GCS. Job names must be unique.
First verify you're in the monorail-prod
GCP project.
gcloud init
To submit a training job manually, run:
TIMESTAMP=$(date +%s) JOB_NAME=component_trainer_$TIMESTAMP gcloud ml-engine jobs submit training $JOB_NAME \ --package-path trainer/ \ --module-name trainer.task \ --runtime-version 1.2 \ --job-dir gs://monorail-prod-mlengine/$JOB_NAME \ --region us-central1 \ --scale-tier custom \ --config config.json \ -- \ --train-steps 10000 \ --eval-steps 1000 \ --verbosity DEBUG \ --gcs-bucket monorail-prod.appspot.com \ --gcs-prefix component_training_data \ --trainer-type component
To upload a model you'll need to locate the exported model directory in GCS. To do that, run:
gsutil ls -r gs://monorail-prod-mlengine/$JOB_NAME # Look for a directory that matches the below structure and assign it. # It should have the structure $GCS_OUTPUT_LOCATION/export/Servo/$TIMESTAMP/. MODEL_BINARIES=gs://monorail-prod-mlengine/component_trainer_1507059720/export/Servo/1507060043/ VERSION=v_$TIMESTAMP gcloud ml-engine versions create $VERSION \ --model component_top_words \ --origin $MODEL_BINARIES \ --runtime-version 1.2
To promote to production, set that model as default.
gcloud ml-engine versions set-default $VERSION --model component_top_words
Use the script component.py
to make predictions from the command line. A file containing text for classification must be provided as the content argument.
$ ./component.py --project monorail-prod --content content.txt Most likely component: index 108, component id 36250211