From charlesreid1

Line 150: Line 150:
===IndexableDataset Example===
===IndexableDataset Example===


IndexableDataset objects do not work the same way - there is no need to store a persistent state - all the data can be accessed randomly, in any order you please.
Code: https://gist.github.com/charlesreid1/eefc22defc8c6bd07c6bd0ac222c9781
 
IndexableDataset objects do not work the same way as IterableDataset objects - there is no need to store a persistent state because all the data can be accessed randomly, in any order you please.


<pre>
<pre>

Revision as of 20:59, 14 October 2017

Basics

Fuel is a library for creating machine learning data pipelines. There are multiple features that make it really convenient.

Find fuel on Github here: https://github.com/mila-udem/fuel

Overview of how it works: https://fuel.readthedocs.io/en/latest/overview.html

Prerequisites

Fuel uses HDF5, so you will need a copy of HDF5 header files installed locally. Use your package manager, or follow HDF5 installation instructions. On a Mac:

$ brew install hdf5

Now you can install Fuel.

Install

$ git clone git@github.com:/mila-udem/fuel.git
$ cd fuel
$ python setup.py build && python setup.py install



Basic Usage

Datasets

Datasets are the principal interface to data. Internally, they use a DataStream object to create and request iterators.

IterableDataset Example

Code: https://gist.github.com/charlesreid1/eefc22defc8c6bd07c6bd0ac222c9781

Suppose we create eight (8) different 2x2 greyscale images, and put them in the variable "features", then create 4 target classes, and put them in "targets":

In [1]: import numpy

In [2]: seed = 1234

In [3]: rng = numpy.random.RandomState(seed)

In [4]: features = rng.randint(256, size=(8, 2, 2))

In [5]: targets = rng.randint(4, size=(8, 1))

Now we can create a Dataset to iterate over the data:

In [6]: from collections import OrderedDict

In [7]: from fuel.datasets import IterableDataset

In [8]: dataset = IterableDataset(
   ...: iterables=OrderedDict([('features', features), ('targets', targets)]),
   ...: axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
   ...: ('targets', ('batch', 'index'))]))

and we can access each attribute using the dataset object:

In [9]: print('Provided sources are {}.'.format(dataset.provides_sources))
Provided sources are ('features', 'targets').

In [10]: print('Sources are {}.'.format(dataset.sources))
Sources are ('features', 'targets').

In [11]: print('Axis labels are {}.'.format(dataset.axis_labels))
Axis labels are OrderedDict([('features', ('batch', 'height', 'width')), ('targets', ('batch', 'index'))]).

In [12]: print('Dataset contains {} examples.'.format(dataset.num_examples))
Dataset contains 8 examples.

In [14]: from pprint import pprint

In [15]: pprint(dir(dataset))
[

...snip...

 'apply_default_transformers',
 'axis_labels',
 'close',
 'default_transformers',
 'example_iteration_scheme',
 'filter_sources',
 'get_data',
 'get_example_stream',
 'iterables',
 'next_epoch',
 'num_examples',
 'open',
 'provides_sources',
 'reset',
 'sources']

Note that the dataset is stateless, so we need to create an external object to represent the state, then pass that into the dataset when we want to iterate over/access the data:

In [17]: state = dataset.open()

In [18]: while True:
    ...:     try:
    ...:         print(dataset.get_data(state=state))
    ...:     except StopIteration:
    ...:         print('Iterator finished')
    ...:         break
    ...:
(array([[ 47, 211],
       [ 38,  53]]), array([0]))
(array([[204, 116],
       [152, 249]]), array([3]))
(array([[143, 177],
       [ 23, 233]]), array([0]))
(array([[154,  30],
       [171, 158]]), array([1]))
(array([[236, 124],
       [ 26, 118]]), array([2]))
(array([[186, 120],
       [112, 220]]), array([2]))
(array([[ 69,  80],
       [201, 127]]), array([2]))
(array([[246, 254],
       [175,  50]]), array([3]))
Iterator finished

To reset the state, use the Dataset object's reset() function. To finish, use the close() function.

In [19]: state = dataset.reset(state=state)

In [20]: print(dataset.get_data(state=state))
(array([[ 47, 211],
       [ 38,  53]]), array([0]))

In [21]: dataset.close(state=state)

IndexableDataset Example

Code: https://gist.github.com/charlesreid1/eefc22defc8c6bd07c6bd0ac222c9781

IndexableDataset objects do not work the same way as IterableDataset objects - there is no need to store a persistent state because all the data can be accessed randomly, in any order you please.


In [1]: from fuel.datasets import IndexableDataset
   ...: from collections import OrderedDict

In [2]: import numpy
   ...: seed = 1234
   ...: rng = numpy.random.RandomState(seed)

In [3]: features = rng.randint(256, size=(8, 2, 2))
   ...: targets = rng.randint(4, size=(8, 1))

In [4]: dataset = IndexableDataset(
   ...:     indexables=OrderedDict([('features', features), ('targets', targets)]),
   ...:     axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
   ...:                              ('targets', ('batch', 'index'))]))

In [5]: state = dataset.open()

In [6]: print("State is {}".format(state))
   ...: print("NOTE: None state returned, because there is no state to maintain!")

State is None
NOTE: None state returned, because there is no state to maintain!

In [7]: print(dataset.get_data(state=state, request=[3,1,0]))
(array([[[154,  30],
        [171, 158]],

       [[204, 116],
        [152, 249]],

       [[ 47, 211],
        [ 38,  53]]]), array([[1],
       [3],
       [0]]))

In [8]: print(dataset.get_data(state=state, request=[1,2,4,7]))
(array([[[204, 116],
        [152, 249]],

       [[143, 177],
        [ 23, 233]],

       [[236, 124],
        [ 26, 118]],

       [[246, 254],
        [175,  50]]]), array([[3],
       [0],
       [2],
       [3]]))

In [9]: dataset.close(state=state)

No need to reset any iterator.

Iteration Schemes

Iteration Scheme Examples

Let's illustrate how to use iteration schemes - but first, how NOT to use iteration schemes.

Incorrect Usage

Recall above, we created a dummy data set of random integers of size (8,2,2) and created a Dataset from it:

~~~*~*~*~*~*~*~~~ flashback ~~~*~*~*~*~*~*~~~

In [1]: import numpy

In [2]: seed = 1234

In [3]: rng = numpy.random.RandomState(seed)

In [4]: features = rng.randint(256, size=(8, 2, 2))

In [5]: targets = rng.randint(4, size=(8, 1))

In [6]: from collections import OrderedDict

In [7]: from fuel.datasets import IterableDataset

In [8]: dataset = IterableDataset(
   ...: iterables=OrderedDict([('features', features), ('targets', targets)]),
   ...: axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
   ...: ('targets', ('batch', 'index'))]))

~~~*~*~*~*~*~*~~~ end flashback ~~~*~*~*~*~*~*~~~

However, we created an IterableDataset, not a Dataset.

This matters because we are going to be modifying the call to get_data(), and for an IterableDataset, there is a predefined order in which get_data() operates - so it doesn't accept any extra arguments.

If we ignore that fact, and incorrectly try and iterate over the IterableDataset in a custom order, we get a ValueError:

In [23]: from fuel.schemes import ShuffledScheme

In [24]: state = dataset.open()

In [25]: scheme = ShuffledScheme(examples=dataset.num_examples, batch_size=4)

In [26]: for request in scheme.get_request_iterator():
    ...:     data = dataset.get_data(state=state, request=request)
    ...:     print(data[0].shape, data[1].shape)
    ...:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-27-24827dafdaa8> in <module>()
      1 for request in scheme.get_request_iterator():
----> 2     data = dataset.get_data(state=state, request=request)
      3     print(data[0].shape, data[1].shape)
      4

/usr/local/lib/python3.6/site-packages/fuel-0.2.0-py3.6-macosx-10.12-x86_64.egg/fuel/datasets/base.py in get_data(self, state, request)
    310     def get_data(self, state=None, request=None):
    311         if state is None or request is not None:
--> 312             raise ValueError
    313         return next(state)
    314

ValueError:

Correct Usage

We'll need to re-create our dataset, this time using an IndexableDataset object.

Wrapping Custom Datasets with Fuel

Repo by github user dribnet illustrates how to wrap a new dataset using Fuel: https://github.com/dribnet/lfw_fuel

Advantages:

  • Only takes one command to download the data and import it into fuel
  • Then it only takes one command to import the library that wraps the data, and be able to turn it into training/testing X and Y

Disadvantages:

  • One-size-fits-all; importing data using load_data() can take a REALLY long time, and must be done every time you run the script (not persistent in memory)
  • Complicated to extend
  • Removes some of the nicer options of fuel

Here is what the final payoff looks like:

from keras.models import Sequential
from lfw_fuel import lfw

# the data, shuffled and split between train and test sets
(X_train, y_train), (X_test, y_test) = lfw.load_data(format="deepfunneled")

# (build the perfect model here)

model.fit(X_train, Y_train, show_accuracy=True, validation_data=(X_test, Y_test))
score = model.evaluate(X_test, Y_test, show_accuracy=True, verbose=0)

Flags