Revision as of 20:59, 14 October 2017

Basics

Fuel is a library for creating machine learning data pipelines. There are multiple features that make it really convenient.

Find fuel on Github here: https://github.com/mila-udem/fuel

Overview of how it works: https://fuel.readthedocs.io/en/latest/overview.html

Prerequisites

Fuel uses HDF5, so you will need a copy of HDF5 header files installed locally. Use your package manager, or follow HDF5 installation instructions. On a Mac:

$ brew install hdf5

Now you can install Fuel.

Install

$ git clone git@github.com:/mila-udem/fuel.git
$ cd fuel
$ python setup.py build && python setup.py install

Basic Usage

Datasets

Datasets are the principal interface to data. Internally, they use a DataStream object to create and request iterators.

IterableDataset Example

Code: https://gist.github.com/charlesreid1/eefc22defc8c6bd07c6bd0ac222c9781

Suppose we create eight (8) different 2x2 greyscale images, and put them in the variable "features", then create 4 target classes, and put them in "targets":

In [1]: import numpy

In [2]: seed = 1234

In [3]: rng = numpy.random.RandomState(seed)

In [4]: features = rng.randint(256, size=(8, 2, 2))

In [5]: targets = rng.randint(4, size=(8, 1))

Now we can create a Dataset to iterate over the data:

In [6]: from collections import OrderedDict

In [7]: from fuel.datasets import IterableDataset

In [8]: dataset = IterableDataset(
   ...: iterables=OrderedDict([('features', features), ('targets', targets)]),
   ...: axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
   ...: ('targets', ('batch', 'index'))]))

and we can access each attribute using the dataset object:

In [9]: print('Provided sources are {}.'.format(dataset.provides_sources))
Provided sources are ('features', 'targets').

In [10]: print('Sources are {}.'.format(dataset.sources))
Sources are ('features', 'targets').

In [11]: print('Axis labels are {}.'.format(dataset.axis_labels))
Axis labels are OrderedDict([('features', ('batch', 'height', 'width')), ('targets', ('batch', 'index'))]).

In [12]: print('Dataset contains {} examples.'.format(dataset.num_examples))
Dataset contains 8 examples.

In [14]: from pprint import pprint

In [15]: pprint(dir(dataset))
[

...snip...

 'apply_default_transformers',
 'axis_labels',
 'close',
 'default_transformers',
 'example_iteration_scheme',
 'filter_sources',
 'get_data',
 'get_example_stream',
 'iterables',
 'next_epoch',
 'num_examples',
 'open',
 'provides_sources',
 'reset',
 'sources']

Note that the dataset is stateless, so we need to create an external object to represent the state, then pass that into the dataset when we want to iterate over/access the data:

In [17]: state = dataset.open()

In [18]: while True:
    ...:     try:
    ...:         print(dataset.get_data(state=state))
    ...:     except StopIteration:
    ...:         print('Iterator finished')
    ...:         break
    ...:
(array([[ 47, 211],
       [ 38,  53]]), array([0]))
(array([[204, 116],
       [152, 249]]), array([3]))
(array([[143, 177],
       [ 23, 233]]), array([0]))
(array([[154,  30],
       [171, 158]]), array([1]))
(array([[236, 124],
       [ 26, 118]]), array([2]))
(array([[186, 120],
       [112, 220]]), array([2]))
(array([[ 69,  80],
       [201, 127]]), array([2]))
(array([[246, 254],
       [175,  50]]), array([3]))
Iterator finished

To reset the state, use the Dataset object's reset() function. To finish, use the close() function.

In [19]: state = dataset.reset(state=state)

In [20]: print(dataset.get_data(state=state))
(array([[ 47, 211],
       [ 38,  53]]), array([0]))

In [21]: dataset.close(state=state)

IndexableDataset Example

Code: https://gist.github.com/charlesreid1/eefc22defc8c6bd07c6bd0ac222c9781

IndexableDataset objects do not work the same way as IterableDataset objects - there is no need to store a persistent state because all the data can be accessed randomly, in any order you please.


In [1]: from fuel.datasets import IndexableDataset
   ...: from collections import OrderedDict

In [2]: import numpy
   ...: seed = 1234
   ...: rng = numpy.random.RandomState(seed)

In [3]: features = rng.randint(256, size=(8, 2, 2))
   ...: targets = rng.randint(4, size=(8, 1))

In [4]: dataset = IndexableDataset(
   ...:     indexables=OrderedDict([('features', features), ('targets', targets)]),
   ...:     axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
   ...:                              ('targets', ('batch', 'index'))]))

In [5]: state = dataset.open()

In [6]: print("State is {}".format(state))
   ...: print("NOTE: None state returned, because there is no state to maintain!")

State is None
NOTE: None state returned, because there is no state to maintain!

In [7]: print(dataset.get_data(state=state, request=[3,1,0]))
(array([[[154,  30],
        [171, 158]],

       [[204, 116],
        [152, 249]],

       [[ 47, 211],
        [ 38,  53]]]), array([[1],
       [3],
       [0]]))

In [8]: print(dataset.get_data(state=state, request=[1,2,4,7]))
(array([[[204, 116],
        [152, 249]],

       [[143, 177],
        [ 23, 233]],

       [[236, 124],
        [ 26, 118]],

       [[246, 254],
        [175,  50]]]), array([[3],
       [0],
       [2],
       [3]]))

In [9]: dataset.close(state=state)

No need to reset any iterator.

Iteration Schemes

Iteration Scheme Examples

Let's illustrate how to use iteration schemes - but first, how NOT to use iteration schemes.

Incorrect Usage

Recall above, we created a dummy data set of random integers of size (8,2,2) and created a Dataset from it:

~~~*~*~*~*~*~*~~~ flashback ~~~*~*~*~*~*~*~~~

In [1]: import numpy

In [2]: seed = 1234

In [3]: rng = numpy.random.RandomState(seed)

In [4]: features = rng.randint(256, size=(8, 2, 2))

In [5]: targets = rng.randint(4, size=(8, 1))

In [6]: from collections import OrderedDict

In [7]: from fuel.datasets import IterableDataset

In [8]: dataset = IterableDataset(
   ...: iterables=OrderedDict([('features', features), ('targets', targets)]),
   ...: axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
   ...: ('targets', ('batch', 'index'))]))

~~~*~*~*~*~*~*~~~ end flashback ~~~*~*~*~*~*~*~~~

However, we created an IterableDataset, not a Dataset.

This matters because we are going to be modifying the call to get_data(), and for an IterableDataset, there is a predefined order in which get_data() operates - so it doesn't accept any extra arguments.

If we ignore that fact, and incorrectly try and iterate over the IterableDataset in a custom order, we get a ValueError:

In [23]: from fuel.schemes import ShuffledScheme

In [24]: state = dataset.open()

In [25]: scheme = ShuffledScheme(examples=dataset.num_examples, batch_size=4)

In [26]: for request in scheme.get_request_iterator():
    ...:     data = dataset.get_data(state=state, request=request)
    ...:     print(data[0].shape, data[1].shape)
    ...:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-27-24827dafdaa8> in <module>()
      1 for request in scheme.get_request_iterator():
----> 2     data = dataset.get_data(state=state, request=request)
      3     print(data[0].shape, data[1].shape)
      4

/usr/local/lib/python3.6/site-packages/fuel-0.2.0-py3.6-macosx-10.12-x86_64.egg/fuel/datasets/base.py in get_data(self, state, request)
    310     def get_data(self, state=None, request=None):
    311         if state is None or request is not None:
--> 312             raise ValueError
    313         return next(state)
    314

ValueError:

Correct Usage

We'll need to re-create our dataset, this time using an IndexableDataset object.

Wrapping Custom Datasets with Fuel

Repo by github user dribnet illustrates how to wrap a new dataset using Fuel: https://github.com/dribnet/lfw_fuel

Advantages:

Only takes one command to download the data and import it into fuel
Then it only takes one command to import the library that wraps the data, and be able to turn it into training/testing X and Y

Disadvantages:

One-size-fits-all; importing data using load_data() can take a REALLY long time, and must be done every time you run the script (not persistent in memory)
Complicated to extend
Removes some of the nicer options of fuel

Here is what the final payoff looks like:

from keras.models import Sequential
from lfw_fuel import lfw

# the data, shuffled and split between train and test sets
(X_train, y_train), (X_test, y_test) = lfw.load_data(format="deepfunneled")

# (build the perfect model here)

model.fit(X_train, Y_train, show_accuracy=True, validation_data=(X_test, Y_test))
score = model.evaluate(X_test, Y_test, show_accuracy=True, verbose=0)

Flags

@@ Line 150: / Line 150: @@
 ===IndexableDataset Example===
-IndexableDataset objects do not work the same way - there is no need to store a persistent state - all the data can be accessed randomly, in any order you please.
+Code: https://gist.github.com/charlesreid1/eefc22defc8c6bd07c6bd0ac222c9781
+IndexableDataset objects do not work the same way as IterableDataset objects - there is no need to store a persistent state because all the data can be accessed randomly, in any order you please.
 <pre>

Fuel: Difference between revisions

From charlesreid1