Engineering TensorFlow Models

This section of the course covers two components:

Feature engineering
Creating data pipeline for feeding data to machine learning model

Module 4a: Feature Engineering

Basic Feature Engineering

Feature engineering and pre-processing using Cloud ML - ways of making our data set better

At this point: we have a way of submitting models to train them in the cloud, so we can train models faster, but we STILL don't have a model that is better than our heuristic

Still have the original TensorFlow model - improve it using feature engineering and hyperparameter tuning

Good features:

Related to objective - reduce arbitrary data
Known a priori
Numeric
Enough examples
Bring human insight (domain expertise) into probem

Features Related To Objective

Related to objective:

Need a reasonable hypothesis for why it matters
For a given domain, different problems require different features

Stupid quiz: related or not?

Objective: predict total number of customers who will use a discount coupon. Which of the following features are important?

Font of the text with which the discount is advertised on partner websites (TRUE)
Price of the item the coupon applies to (TRUE)
Number of items in stock (FALSE)

Objective: predict whether a credit-card transaction is fraudulent

Whether cardholder has purchased items at store before (TRUE)
Credit card chip reader speed (FALSE)
Category of item being purchased (TRUE)
Expiry date of credit card (FALSE)

Values Known A Priori

Mainly important for training on old data/predicting on new data

Suppose your data warehouse had all sorts of information, and you threw it all into the data warehouse

Sales information might have sales data - but it might be stale.

Some information known immediately
Some information is not known at prediction time
If you train a model on data that you don't have at prediction time, your entire model will be useless
Ensure every feature/every input will be available AT PREDICTION TIME
There may be some ethical issues with gathering data IMMEDIATELY from user

Quiz: is the value knowable or not?

Objective: predict total number of customers who will use a discount coupon

Total number of discountable items sold (DEPENDS - too vague)
Number of discountable items sold previous month (YES - you will most likely have this data in real time, but it does depend on your system)
Number of customers who viewed ads about item (YES - but question of time... how long does ad analysis take to get back)

Objective: predict whether credit card transaction is fraudulent

Whether cardholder has purchased items at this store before (DEPENDS - may take 3-5 days to get transaction data... train with data AS OF 3 DAYS AGO... if stale data is what you have in real time, then stale data is what you need to use to TRAIN your model)
Whether item is new at store (and cannot have been purchased before) (YES - should know from catalog)
Category of item being purchased (YES - will definitely know the type of the item)
Online or in-person purchase (YES - ditto... will def. know type of purchase)

Numerical

NN carries out simple mathematical functions on your inputs, so the inputs need to be numeric. Non-numeric features can be used, but we need a way to convert them to numerical form.

Stupid quiz: which is numeric?

Feature of discount coupon to predict number of coupons that will be used:

Percent value of discount - TRUE
Size of coupon (4 cm2, 24 cm2, 48 cm2, etc.) - TRUE (but not really a meaningful magnitude)
Font of advertisement (Arial 18, Times New Roman 24, etc.) - FALSE (no meaningful magnitude)
Color of coupon (red, black, blue) - FALSE (no meaningful magnitude)
Item category (1 for dairy, 2 for deli, 3 for canned goods, etc.) - TRUE (using word2vec, you DO get a meaningful vector... meaningful magnitude)

Example questions: what if you subtract two values (e.g., subtract two colors)? Does that have a representative effect on the prediction?

If you do an arbitrary item categorization, you lose meaning and qualities of words (e.g., male/female, soft/hard, positive/negative, etc.). So, word2vec greatly improves the ability of word inputs to help improve the prediction.

Enough Examples

Each feature needs enough examples to be understandable in context

Rule of thumb - need AT LEAST five examples for a category to be usable in an example

Quiz: which is difficult to have enough examples of?

Predicting total number of customers who will use a coupon:

Percent discount of coupon - DEPENDS (find five examples each, or throw it out; if you have continuous numbers, bin them into discrete groups)
Date that promotional offer starts - DEPENDS (bin them up again, e.g., promotional offers starting in January or in Q1)
Number of customers who opened advertising emails - TRUE (should have a number of different emails, and know how many customers opened each)

Predict whether CC transaction is fraudulent:

Whether cardholder has purchased this item at this store - TRUE (should have this, unless it is too specific, e.g., bought diapers between 8 and 9 pm)
Distance between cardholder and store - DEPENDS (again, should bin these up; may not have 5 examples of cardholders who bought something from a store more than 100 miles from their house)
Category of item being purchased - TRUE
Online or in-person purchase - TRUE

How to check? Plot histograms of data.

Turning Raw Data into Numeric Features

Example: running ice cream store, want to predict the rating a customer will give based on how long they've been waiting and what they bought

Raw data to TensorFlow feature column.

Raw data:

{
    "transactionId" : 42,
    "name" : "Ice Cream",
    "price" : 2.50,
    "tags" : ["cold", "dessert"],
    "servedBy" : {
        "employeeId" : 45042,
        "waitTime" : 1.4,
        "customerRating" : 4
    },
    "storeLocation" : {
        "latitude" : 35.3,
        "longitude" : -98.7
    }
}

(This data comes from a web app, goes into a data warehouse, and is pulled out as JSON data)

Creating a feature column

To turn this into a feature column:

Some fields can be used directly (e.g., customer rating, price, and wait time)
Others (e.g., transactionId) should be ignored (don't have more than 5 examples)
Some (e.g., employeeId) should be transformed - no meaningful magnitude... use one hot encoding

INPUT_COLUMNS = [
       ... 
       layers.real_valued_column('price'),
       layers.real_valued_column('waitTime'),
       ... 
]

This calls the TensorFlow function real_valued_column because these columns are continuous and their magnitude is meaningful.

Preprocessing and Data Vocabulary

Preprocessing the data creates a new "vocabulary" of keys - and it needs to be available for BOTH training AND prediction steps (e.g., prediction sends employeeId 75534, and model needs to know how to convert that to a one hot encoding)

Three example scenarios:

First scenario: you already know the keys beforehand (e.g., employeeId and one hot encoding):

layers.sparse_column_with_keys('employeeId', keys=['12345', '48506', '28488', '23456'])

Second scenario: your data is already indexed 0 to N, but does not have a meaningful magnitude (e.g., hour of the day):

layers.sparse_column_with_integerized_feature('employeeId', 5)

Third scenario: you don't have a vocabulary of all possible values:

layers.sparse_column_hashbucket('employeeId', 500) # Hash the employee ID, and break it into 500 buckets

Hash bucket is similar to one hot encoding, but without having to explicitly build the encoding scheme.

All three of these use sparse_column_* methods in TensorFlow, because they create sparse columns (columns of booleans).

Columns leading to choices

Some columns lead you to choices you have to make.

Two questions:

Question 1 - what to do with customer rating?
Question 2 - what to do with missing data?

What approach should we use for customer rating?

You have a choice.
One hot encoding if you decide 1 and 2 and 3 and 4 are VERY different
Continuous if you decide sliding scale is okay

What if you have missing data - if customer didn't provide a rating?

You have options
Can use a column to indicate whether the customer left a rating (0 or 1), and another column for the rating (0 if no rating)
Can use one hot encoding (one column would indicate a rating of 4), so customers who don't leave a rating are just 0, 0, 0, 0, 0
Be careful not to mix "magic (categorical) numbers" with "real (meaningful) numbers"

This also leads us to the difference between statistics and machine learning.

Statistics - imputation refers to the fact that we often fill in missing values with the average of the rest of the values (we want to preserve the information we have about the entire population as much as we can).

Machine learning - we want to separate out the cases where we have data from the cases where we don't have data, and build SEPARATE models (model behaviors) for those two cases.

Machine learning can build separate models for the data/no data case, because we have enough examples that we don't need to try to stretch our existing data set as far as we can. We just train the model to have different behavior in the data/no data case. (The same argument is true of outliers - in statistics, we throw out outliers because they contaminate the data we do have; in machine learning, we leave the outliers in because we have enough data that the outliers form their own separate model behavior.)

Creating New Features

What else can we do to go beyond the raw data?

Feature cross

Example to illustrate why feature crossing is important:

Deciding whether a picture of a vehicle is a taxi
Input columns: car color, and city
Output prediction: is it a taxi
Suppose we use a linear model - one input variable is color, another input variable is city - and output is whether the car is a taxi
Then if we give it examples from New York, where all taxis are yellow, model will learn that all yellow vehicles are taxis (yellow gets high weight)
If we train it on data from Rome, where most taxis are white, model will learn that white vehicles are taxis (white gets high weight)
Linear model cannot "learn" that different cities have different color taxis

Solution:

One solution is to add more layers - this will "mix" the inputs. But this creates more parameters, more complexity; especially if there are many inputs, many potential variable interactions, but only one or two variable interactions that are actually significant
Better solution - take a combination of the two and add it as a new column
If inputs are strings, one-hot-encode (example, Red Rome becomes RR, White Rome becomes WR, Yellow NYC becomes YN, etc.)
This makes it EASY for the machine learning model to learn that this combination is important
Use human insight to make it easier for the machine learning model

Feature cross with categorical columns

To do this in TensorFlow, create a crossed column with two sparse columns:

day_hr = layers.crossed_column([dayofweek, hourofday], 24*7)

24*7 is the number of buckets (if we choose fewer buckets, we get some grouping)

In the taxi model, this will help us capture "rush hour" (Thursday 5PM is different from Friday 5PM is different from Saturday 5PM)

Feature cross with real valued columns

How to do feature crosses with real-valued columns?

Need to discretize/bucketize floating points - this prevents overfitting (by treating the dimension as too highly discretized)

Example: predicting the price of a home in California

House price vs Latitude: see two spikes (LA and SF)
If we discretize too much, 34.001 and 34.002 will be considered "different"
Group everything into bins

In TensorFlow, can bucketize two real valued columns as follows:

latbuckets = np.linspace(32.0, 42.0, nbuckets).tolist()
discrete_lat = layers.bucketized_column(lat, latbuckets)

Pipeline for Processing ML Data

Here's what the pipeline looks like now:

Inputs
Pre-process inputs & create a model vocabulary (scaling, transforms, bucketizing, labeling, categorical features like states/zip codes/employee IDs)
Feature creation (feature crossing)
Train model
TensorFlow model

Model Architectures

Two kinds of features: dense and sparse

Price: represented by just one real-valued column

Dense feature

Employee ID: if you have N employees, need N-1 columns

Sparse feature

Why dense features are easier:

Suppose we are doing image processing - every pixel of the image is a dense feature
This is easy for a neural network to deal with
Images are perfect for doing the operations that neural networks are good at - multiplying, adding, crossing, etc.

Why sparse features are harder:

Sparse features look very different - lots and lots of zeros, most rows are nearly all zeros
When you add/subtract rows, the result is still going to be a row with almost all zeros
This is difficult for a neural network to deal with - many weights in the network will have zero impact
More likely to get stuck in a local region and be unable to get out of

Sparse features = linear models

Linear models do very well with sparse features and sparse representations
More likely for sparse number of neurons to get high weights

Observation:

If we have many dense features (e.g., images), we want to have lots of layers, lots of neurons, lots of hidden layers, lots of dense embeddings, all leading to sparse features
If we have sparse features, we want WIDER models - that is, models where there are fewer layers between the inputs and the outputs (neural network equivalent of a linear model - single layer of neurons)

Wide models vs. Deep models

Sparse data requires wide models
Wide models have fewer layers, behave more like linear models
Dense data requires deep models
Deep models have more layers, denser embeddings, more hidden layers

How to mix these?

Real models have a mixture of both dense and sparse features
Take your inputs and divide them into dense and sparse
The dense inputs go into a deep model, the sparse inputs go into a wide model

Wide and Deep Networks in TensorFlow

To have your cake and eat it too, you can use the DNNLinearCombinedClassifier class in TensorFlow:

model = tf.contrib.learn.DNNLinearCombinedClassifier(
    model_dir = ...,
    linear_feature_columns = wide_columns,
    dnn_feature_columns = deep_columns,
    dnn_hidden_units = [100, 50])

Specify the sparse columns as "wide_columns" by passing to "linear_feature_columns" argument.

Specify the dense columns as "deep_columns" by passing to "dnn_feature_columns" argument.

dnn_hidden_units specifies number of layers and number of nodes to use in the network.

Module 4b: Data Pipeline Engineering

Resources

Flags

GCDEC/Engineering Tensorflow/Notes

From charlesreid1

Contents