From charlesreid1

No edit summary
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
=Pick Up Here=
See the two sub-pages:
* [[GCDEC/Building Tensorflow/Notes]]
* [[GCDEC/Deploying Tensorflow/Notes]]
* [[GCDEC/Engineering Tensorflow/Notes]]


===TensorFlow Architecture for Out of Memory Learning===


Back to the middle layer:
* Reminder, these are the components that are useful when building custom NN models
* tf.layers, tf.losses, tf.metrics
Recap of terminology:
* We will store our data in multiple files
* One step = going through single batch of training data once
* One epoch = going through entire training data once
Reading data from out of memory:
* To go through our data for 50 epochs, we just need to create a filename queue (from randomly shuffled filenames) that contains our file names 50 times each
* Example: dealing with three files A, B, C: our filename queue should be B B C A ... (enqueue_many function)
* Then we dequeue each file, one at a time, using a Reader (dequeue function)
* The reader decodes the data and turns it into data
* The data then goes into an Example Queue (using the enqueye function)
* Why shuffle filenames and add them in random order? When doing distributed learning, we don't want to bias our learning process, or have one file cause a slowdown (on exact same machine each time)
* Each Reader will be on a different machine; each Reader takes filenames from the queue, and creates an example queue (an example is an input plus a label)
* Then, TensorFlow model reads data from the Example Queue
Reading a CSV file num_epochs times:
Start by setting labels for the columns in the files being read:
<pre>
CSV_COLUMNS = ['fare_amount', 'pickuplon', 'pickuplat', ...]
LABEL_COLUMN = ['fare_amount']
# Now define default values that each value will take on
# (This keeps the ML model from choking if there are one or two missing pieces of data)
Defaults = [[0.0], [-74.0], [40.0], [-74.0], [40.7], [1.0], ['nokey']]
</pre>
Next, define an input function that will do a wildcard match, and assemble each filename and put it into the Filename Queue:
<pre>
def input_fn():
    input_file_names = tf.train.match_filenames_once(filename)
    filename_queue = tf.train.string_input_producer( input_file_names, num_epochs=num_epochs, shuffle=True)
    # now make the Reader
    reader = tf.TextLineReader()
    _, value = reader.read_up_to(filename_queue, num_records = batch_size)
    value_column = tf.expand_dims(value, -1)
    columns = tf.decode_csv(value_column, record_defaults = DEFAULTS)
    features = dict(zip(CSV_COLUMNS, columns))
    # Take the one label item and pop it from features.
    # Assign the result to label, so now label is a dictionary too.
    label = features.pop(LABEL_COLUMN)
    return features, label
</pre>
===Reading CSV Files num_epochs times===
In the input function:
* Match all filenames (can have a wildcard, like train.*) or sharded files (train-00001-36, train-00002-36, etc)
* Then, take those input files and repeat them (in a shuffled way) num_epochs times
* Now, create the reader with TextLineReader() to read CSV files
* Tell the reader to read a batch of records from the filename queue
* This is just a line, so we use expand_dims() to make the scalar into a tensor
* Then we do decode_csv to decode this as a comma-separated string
* We need to tell TF what the datatypes are, and what to do if the value of the field is missing
* We now have our values
* But our features have to be a dictionary - where each column is an entry in the dict, with the key being the name of the column
* Associate the field names with the tensor values to make it a dictionary (that's features). One key is fare_amount, next key is pickuplon, next is pickuplat, etc.
* Each key has a tensor associated with it
* Those are our features - except that fare_amount is the label column, and we aren't trying to predict it
* saying features.pop(LABEL_COLUMN) tells TF to leave out the quantity we're trying to predict (as output) from the list of inputs
* We then return the list of features (the dictionary of label:value for each feature) and the label column name
TextLineReader() can read from local files, or from GCS
What it is doing is:
* Decoding CSV
* Creating a dictionary of features
* Creating a dictionary of labels (via features.pop(LABEL_COLUMN))
* Returning features and labels
This decode_csv can be fed a CSV from a local disk, or from Google Cloud Storage
Next lab:
* Refactor TensorFlow model
* Read from a potentially large data set/file in batches
* Do a wildcard match on filenames and feed them to a filename queue
* Break up the one-to-one relationships between inputs and features (unclear what this means, exactly)
This will smooth the way to running this TensorFlow model at scale.
===Refactoring the ML Model for Big Data===
Link to lab: https://codelabs.developers.google.com/codelabs/dataeng-machine-learning/index.html?index=#7
Link to notebook: https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/tensorflow/c_batched.ipynb
====First refactoring: reading input data in batch====
The first refactoring addresses how the input data are being read. A filename queue is added to the TensorFlow graph, instead of reading the file directly into a Pandas dataframe. We pass the filename, and use this tf.train.match_filename_once() thing. We use a string producer to generate the (one single) filename over and over. We shuffle the input filename queue. We repeat each file num_epochs times. Here's the whole mess:
<pre>
def read_dataset(filename, num_epochs=None, batch_size=512, mode=tf.contrib.learn.ModeKeys.TRAIN):
  def _input_fn():
    input_file_names = tf.train.match_filenames_once(filename)
    filename_queue = tf.train.string_input_producer(
        input_file_names, num_epochs=num_epochs, shuffle=True)
    reader = tf.TextLineReader()
    _, value = reader.read_up_to(filename_queue, num_records=batch_size)
    value_column = tf.expand_dims(value, -1)
    columns = tf.decode_csv(value_column, record_defaults=DEFAULTS)
    features = dict(zip(CSV_COLUMNS, columns))
    label = features.pop(LABEL_COLUMN)
    return features, label
  return _input_fn
</pre>
====Second refactoring: treat input data and features as different====
The second refactoring addresses the way we turn input data into features. They refactor this so that they are specifically extracting the input variables in one step, then explicitly specifying the model features in another, separate step. What they mean by "break the one-to-one relationship between inputs and features" is, we aren't forced to use the input data and only the input data as our model features. Once we change the way the input data is loaded (i.e., if we don't read data straight from the input file into the model), we can transform input variables, leave certain input variables out of the model, normalize them, combine them together, etc.
====Third refactoring: Move model evaluation into training loop====
The problem with the notebook, as is, is that we're specifying a number of epochs. Instead, we want to evaluate the model as we go, and stop when we reach a criteria.
(This will happen in the next lab.)
Also a checkpointing problem - we save checkpoints during the training, and use the final checkpoint as the final model. (Discussion of overfitting - we may not want the last step, because it may be overfit.) This will also be improved by stopping the model training when we reach some error criteria.
Train the model on the training data set, and every few steps, stop and assess RMSE on the validation data set. Stop when the RMSE on the validation data set starts to increase (indicates we're overfitting).
===What to improve further?===
Handle machine failure in distributed training - what if something goes wrong? Want to be able to pick up training wherever we left off.
Monitor training - especially useful if training is expected to take a very long time. Answer questions like, which epoch are we on, what is the current RMSE, etc.
Choose a model based on the validation data set - use a smarter stopping criteria than number of epochs.
===How much does a reasonably realistic machine learning model cost?===
'''It will cost a few thousand dollars for a ''reasonably realistic'' model'''
==Module 4: Feature Engineering==
=References=
=Flags=


[[Category:Google Cloud]]
[[Category:Google Cloud]]
[[Category:Data Engineering]]
[[Category:Data Engineering]]
[[Category:Tensorflow]]
[[Category:ML]]
[[Category:NN]]

Latest revision as of 10:21, 7 January 2018