From charlesreid1

 
(20 intermediate revisions by the same user not shown)
Line 17: Line 17:
Now you can install Fuel.
Now you can install Fuel.


==Install==
==Install Fuel from Source==


<pre>
<pre>
$ git clone git@github.com:/mila-udem/fuel.git
$ git clone git@github.com:/mila-udem/fuel.git
$ cd fuel
$ cd fuel
$ python setup.py build && python setup.py install
$ python setup.py build  
$ python setup.py install
</pre>
</pre>


=Basic Usage=
=Basic Usage=


==Datasets==
{{Main|Fuel/Usage}}


Datasets are the principal interface to data. Internally, they use a DataStream object to create and request iterators.
Summary:
 
* [[Fuel/Usage#Datasets|Datasets]] are the principal interface to data, but are abstract classes
===Datasets Example===
* [[Fuel/Usage#IterableDataset Example|IterableDatasets]] (less powerful) allow sequential access to data in specified order only
 
* [[Fuel/Usage#IndexableDataset Example|IndexableDatasets]] (more powerful) allow random access to data
Suppose we create eight (8) different 2x2 greyscale images, and put them in the variable "features", then create 4 target classes, and put them in "targets":
* [[Fuel/Usage#Iteration Schemes|Schemes]] allow iterating through IndexablelDatasets in various orders (batch, sequential, shuffle, etc.)
 
<pre>
In [1]: import numpy
 
In [2]: seed = 1234
 
In [3]: rng = numpy.random.RandomState(seed)
 
In [4]: features = rng.randint(256, size=(8, 2, 2))
 
In [5]: targets = rng.randint(4, size=(8, 1))
</pre>
 
Now we can create a Dataset to iterate over the data:
 
<pre>
In [6]: from collections import OrderedDict
 
In [7]: from fuel.datasets import IterableDataset
 
In [8]: dataset = IterableDataset(
  ...: iterables=OrderedDict([('features', features), ('targets', targets)]),
  ...: axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
  ...: ('targets', ('batch', 'index'))]))
</pre>
 
and we can access each attribute using the dataset object:
 
<pre>
In [9]: print('Provided sources are {}.'.format(dataset.provides_sources))
Provided sources are ('features', 'targets').
 
In [10]: print('Sources are {}.'.format(dataset.sources))
Sources are ('features', 'targets').
 
In [11]: print('Axis labels are {}.'.format(dataset.axis_labels))
Axis labels are OrderedDict([('features', ('batch', 'height', 'width')), ('targets', ('batch', 'index'))]).
 
In [12]: print('Dataset contains {} examples.'.format(dataset.num_examples))
Dataset contains 8 examples.
 
In [14]: from pprint import pprint
 
In [15]: pprint(dir(dataset))
['__abstractmethods__',
'__class__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__ge__',
'__getattribute__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__le__',
'__lt__',
'__module__',
'__ne__',
'__new__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__setattr__',
'__sizeof__',
'__str__',
'__subclasshook__',
'__weakref__',
'_abc_cache',
'_abc_negative_cache',
'_abc_negative_cache_version',
'_abc_registry',
'apply_default_transformers',
'axis_labels',
'close',
'default_transformers',
'example_iteration_scheme',
'filter_sources',
'get_data',
'get_example_stream',
'iterables',
'next_epoch',
'num_examples',
'open',
'provides_sources',
'reset',
'sources']
</pre>
 
Note that the dataset is stateless, so we need to create an external object to represent the state, then pass that into the dataset when we want to iterate over/access the data:
 
<pre>
In [17]: state = dataset.open()
 
In [18]: while True:
    ...:    try:
    ...:        print(dataset.get_data(state=state))
    ...:    except StopIteration:
    ...:        print('Iterator finished')
    ...:        break
    ...:
(array([[ 47, 211],
      [ 38,  53]]), array([0]))
(array([[204, 116],
      [152, 249]]), array([3]))
(array([[143, 177],
      [ 23, 233]]), array([0]))
(array([[154,  30],
      [171, 158]]), array([1]))
(array([[236, 124],
      [ 26, 118]]), array([2]))
(array([[186, 120],
      [112, 220]]), array([2]))
(array([[ 69,  80],
      [201, 127]]), array([2]))
(array([[246, 254],
      [175,  50]]), array([3]))
Iterator finished
</pre>
 
To reset the state, use the Dataset object's reset() function. To finish, use the close() function.
 
<pre>
In [19]: state = dataset.reset(state=state)
 
In [20]: print(dataset.get_data(state=state))
(array([[ 47, 211],
      [ 38,  53]]), array([0]))
 
In [21]: dataset.close(state=state)
</pre>


=Wrapping Custom Datasets with Fuel=
=Wrapping Custom Datasets with Fuel=


Repo by github user dribnet illustrates how to wrap a new dataset using Fuel: https://github.com/dribnet/lfw_fuel
{{Main|Fuel/Custom Datasets}}
 
Advantages:
* Only takes one command to download the data and import it into fuel
* Then it only takes one command to import the library that wraps the data, and be able to turn it into training/testing X and Y
 
Disadvantages:
* One-size-fits-all; importing data using load_data() can take a REALLY long time, and must be done every time you run the script (not persistent in memory)
* Complicated to extend
* Removes some of the nicer options of fuel
 
Here is what the final payoff looks like:
 
<pre>
from keras.models import Sequential
from lfw_fuel import lfw
 
# the data, shuffled and split between train and test sets
(X_train, y_train), (X_test, y_test) = lfw.load_data(format="deepfunneled")
 
# (build the perfect model here)
 
model.fit(X_train, Y_train, show_accuracy=True, validation_data=(X_test, Y_test))
score = model.evaluate(X_test, Y_test, show_accuracy=True, verbose=0)
</pre>


Basically, the process of wrapping a custom data set with fuel looks like this:
* Specify how the original data should be downloaded, processed, and turned into a fuel data set
* Specify how the fuel data set should be loaded


The first step - defining how to turn original data into fuel data:
* Create a download wrapper - this tells fuel how to download the original data ("briq" download?)
* Define a way to load a single piece of data (e.g., parameterized by name) and, optionally, paired/related pieces of data (e.g., two related images)
* Convert function to extract all data and assemble it all into an HDF5 file (and remove original data when finished)


The second step - specifying how the fuel data set should be loaded:
* Create a fuel Datasets object (inheriting from, e.g., H5PYDataset)
* Define a way for that data to be loaded (example: make a universally-available load_data method in a package specific to your data set, as in lfw_fuel)


=Flags=
=Flags=


 
{{FuelFlag}}
[[Category:Data Engineering]]
[[Category:NN]]
[[Category:ML]]

Latest revision as of 21:43, 15 October 2017

Basics

Fuel is a library for creating machine learning data pipelines. There are multiple features that make it really convenient.

Find fuel on Github here: https://github.com/mila-udem/fuel

Overview of how it works: https://fuel.readthedocs.io/en/latest/overview.html

Prerequisites

Fuel uses HDF5, so you will need a copy of HDF5 header files installed locally. Use your package manager, or follow HDF5 installation instructions. On a Mac:

$ brew install hdf5

Now you can install Fuel.

Install Fuel from Source

$ git clone git@github.com:/mila-udem/fuel.git
$ cd fuel
$ python setup.py build 
$ python setup.py install

Basic Usage

Summary:

  • Datasets are the principal interface to data, but are abstract classes
  • IterableDatasets (less powerful) allow sequential access to data in specified order only
  • IndexableDatasets (more powerful) allow random access to data
  • Schemes allow iterating through IndexablelDatasets in various orders (batch, sequential, shuffle, etc.)

Wrapping Custom Datasets with Fuel

Basically, the process of wrapping a custom data set with fuel looks like this:

  • Specify how the original data should be downloaded, processed, and turned into a fuel data set
  • Specify how the fuel data set should be loaded

The first step - defining how to turn original data into fuel data:

  • Create a download wrapper - this tells fuel how to download the original data ("briq" download?)
  • Define a way to load a single piece of data (e.g., parameterized by name) and, optionally, paired/related pieces of data (e.g., two related images)
  • Convert function to extract all data and assemble it all into an HDF5 file (and remove original data when finished)

The second step - specifying how the fuel data set should be loaded:

  • Create a fuel Datasets object (inheriting from, e.g., H5PYDataset)
  • Define a way for that data to be loaded (example: make a universally-available load_data method in a package specific to your data set, as in lfw_fuel)

Flags