Fuel: Difference between revisions
From charlesreid1
(→Basics) |
|||
| Line 4: | Line 4: | ||
Find fuel on Github here: https://github.com/mila-udem/fuel | Find fuel on Github here: https://github.com/mila-udem/fuel | ||
Overview of how it works: https://fuel.readthedocs.io/en/latest/overview.html | |||
==Prerequisites== | ==Prerequisites== | ||
| Line 23: | Line 25: | ||
</pre> | </pre> | ||
==Wrapping Custom Datasets with Fuel | |||
=Basic Usage= | |||
==Datasets== | |||
Datasets are the principal interface to data. Internally, they use a DataStream object to create and request iterators. | |||
===Example=== | |||
Suppose we create eight (8) different 2x2 greyscale images, and put them in the variable "features", then create 4 target classes, and put them in "targets": | |||
<pre> | |||
In [1]: import numpy | |||
In [2]: seed = 1234 | |||
In [3]: rng = numpy.random.RandomState(seed) | |||
In [4]: features = rng.randint(256, size=(8, 2, 2)) | |||
In [5]: targets = rng.randint(4, size=(8, 1)) | |||
</pre> | |||
Now we can create a Dataset to iterate over the data: | |||
<pre> | |||
In [6]: from collections import OrderedDict | |||
In [7]: from fuel.datasets import IterableDataset | |||
In [8]: dataset = IterableDataset( | |||
...: iterables=OrderedDict([('features', features), ('targets', targets)]), | |||
...: axis_labels=OrderedDict([('features', ('batch', 'height', 'width')), | |||
...: ('targets', ('batch', 'index'))])) | |||
</pre> | |||
and we can access each attribute using the dataset object: | |||
<pre> | |||
In [9]: print('Provided sources are {}.'.format(dataset.provides_sources)) | |||
Provided sources are ('features', 'targets'). | |||
In [10]: print('Sources are {}.'.format(dataset.sources)) | |||
Sources are ('features', 'targets'). | |||
In [11]: print('Axis labels are {}.'.format(dataset.axis_labels)) | |||
Axis labels are OrderedDict([('features', ('batch', 'height', 'width')), ('targets', ('batch', 'index'))]). | |||
In [12]: print('Dataset contains {} examples.'.format(dataset.num_examples)) | |||
Dataset contains 8 examples. | |||
In [14]: from pprint import pprint | |||
In [15]: pprint(dir(dataset)) | |||
['__abstractmethods__', | |||
'__class__', | |||
'__delattr__', | |||
'__dict__', | |||
'__dir__', | |||
'__doc__', | |||
'__eq__', | |||
'__format__', | |||
'__ge__', | |||
'__getattribute__', | |||
'__gt__', | |||
'__hash__', | |||
'__init__', | |||
'__init_subclass__', | |||
'__le__', | |||
'__lt__', | |||
'__module__', | |||
'__ne__', | |||
'__new__', | |||
'__reduce__', | |||
'__reduce_ex__', | |||
'__repr__', | |||
'__setattr__', | |||
'__sizeof__', | |||
'__str__', | |||
'__subclasshook__', | |||
'__weakref__', | |||
'_abc_cache', | |||
'_abc_negative_cache', | |||
'_abc_negative_cache_version', | |||
'_abc_registry', | |||
'apply_default_transformers', | |||
'axis_labels', | |||
'close', | |||
'default_transformers', | |||
'example_iteration_scheme', | |||
'filter_sources', | |||
'get_data', | |||
'get_example_stream', | |||
'iterables', | |||
'next_epoch', | |||
'num_examples', | |||
'open', | |||
'provides_sources', | |||
'reset', | |||
'sources'] | |||
</pre> | |||
Note that the dataset is stateless, so we need to create an external object to represent the state, then pass that into the dataset when we want to iterate over/access the data: | |||
<pre> | |||
In [17]: state = dataset.open() | |||
In [18]: while True: | |||
...: try: | |||
...: print(dataset.get_data(state=state)) | |||
...: except StopIteration: | |||
...: print('Iterator finished') | |||
...: break | |||
...: | |||
(array([[ 47, 211], | |||
[ 38, 53]]), array([0])) | |||
(array([[204, 116], | |||
[152, 249]]), array([3])) | |||
(array([[143, 177], | |||
[ 23, 233]]), array([0])) | |||
(array([[154, 30], | |||
[171, 158]]), array([1])) | |||
(array([[236, 124], | |||
[ 26, 118]]), array([2])) | |||
(array([[186, 120], | |||
[112, 220]]), array([2])) | |||
(array([[ 69, 80], | |||
[201, 127]]), array([2])) | |||
(array([[246, 254], | |||
[175, 50]]), array([3])) | |||
Iterator finished | |||
</pre> | |||
To reset the state, use the Dataset object's reset() function. To finish, use the close() function. | |||
<pre> | |||
In [19]: state = dataset.reset(state=state) | |||
In [20]: print(dataset.get_data(state=state)) | |||
(array([[ 47, 211], | |||
[ 38, 53]]), array([0])) | |||
In [21]: dataset.close(state=state) | |||
</pre> | |||
=Wrapping Custom Datasets with Fuel= | |||
Repo by github user dribnet illustrates how to wrap a new dataset using Fuel: https://github.com/dribnet/lfw_fuel | Repo by github user dribnet illustrates how to wrap a new dataset using Fuel: https://github.com/dribnet/lfw_fuel | ||
Advantages: | |||
* Only takes one command to download the data and import it into fuel | |||
* Then it only takes one command to import the library that wraps the data, and be able to turn it into training/testing X and Y | |||
Disadvantages: | |||
* One-size-fits-all; importing data using load_data() can take a REALLY long time, and must be done every time you run the script (not persistent in memory) | |||
* Complicated to extend | |||
* Removes some of the nicer options of fuel | |||
Here is what the final payoff looks like: | |||
<pre> | <pre> | ||
Revision as of 20:33, 14 October 2017
Basics
Fuel is a library for creating machine learning data pipelines. There are multiple features that make it really convenient.
Find fuel on Github here: https://github.com/mila-udem/fuel
Overview of how it works: https://fuel.readthedocs.io/en/latest/overview.html
Prerequisites
Fuel uses HDF5, so you will need a copy of HDF5 header files installed locally. Use your package manager, or follow HDF5 installation instructions. On a Mac:
$ brew install hdf5
Now you can install Fuel.
Install
$ git clone git@github.com:/mila-udem/fuel.git $ cd fuel $ python setup.py build && python setup.py install
Basic Usage
Datasets
Datasets are the principal interface to data. Internally, they use a DataStream object to create and request iterators.
Example
Suppose we create eight (8) different 2x2 greyscale images, and put them in the variable "features", then create 4 target classes, and put them in "targets":
In [1]: import numpy In [2]: seed = 1234 In [3]: rng = numpy.random.RandomState(seed) In [4]: features = rng.randint(256, size=(8, 2, 2)) In [5]: targets = rng.randint(4, size=(8, 1))
Now we can create a Dataset to iterate over the data:
In [6]: from collections import OrderedDict
In [7]: from fuel.datasets import IterableDataset
In [8]: dataset = IterableDataset(
...: iterables=OrderedDict([('features', features), ('targets', targets)]),
...: axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
...: ('targets', ('batch', 'index'))]))
and we can access each attribute using the dataset object:
In [9]: print('Provided sources are {}.'.format(dataset.provides_sources))
Provided sources are ('features', 'targets').
In [10]: print('Sources are {}.'.format(dataset.sources))
Sources are ('features', 'targets').
In [11]: print('Axis labels are {}.'.format(dataset.axis_labels))
Axis labels are OrderedDict([('features', ('batch', 'height', 'width')), ('targets', ('batch', 'index'))]).
In [12]: print('Dataset contains {} examples.'.format(dataset.num_examples))
Dataset contains 8 examples.
In [14]: from pprint import pprint
In [15]: pprint(dir(dataset))
['__abstractmethods__',
'__class__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__ge__',
'__getattribute__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__le__',
'__lt__',
'__module__',
'__ne__',
'__new__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__setattr__',
'__sizeof__',
'__str__',
'__subclasshook__',
'__weakref__',
'_abc_cache',
'_abc_negative_cache',
'_abc_negative_cache_version',
'_abc_registry',
'apply_default_transformers',
'axis_labels',
'close',
'default_transformers',
'example_iteration_scheme',
'filter_sources',
'get_data',
'get_example_stream',
'iterables',
'next_epoch',
'num_examples',
'open',
'provides_sources',
'reset',
'sources']
Note that the dataset is stateless, so we need to create an external object to represent the state, then pass that into the dataset when we want to iterate over/access the data:
In [17]: state = dataset.open()
In [18]: while True:
...: try:
...: print(dataset.get_data(state=state))
...: except StopIteration:
...: print('Iterator finished')
...: break
...:
(array([[ 47, 211],
[ 38, 53]]), array([0]))
(array([[204, 116],
[152, 249]]), array([3]))
(array([[143, 177],
[ 23, 233]]), array([0]))
(array([[154, 30],
[171, 158]]), array([1]))
(array([[236, 124],
[ 26, 118]]), array([2]))
(array([[186, 120],
[112, 220]]), array([2]))
(array([[ 69, 80],
[201, 127]]), array([2]))
(array([[246, 254],
[175, 50]]), array([3]))
Iterator finished
To reset the state, use the Dataset object's reset() function. To finish, use the close() function.
In [19]: state = dataset.reset(state=state)
In [20]: print(dataset.get_data(state=state))
(array([[ 47, 211],
[ 38, 53]]), array([0]))
In [21]: dataset.close(state=state)
Wrapping Custom Datasets with Fuel
Repo by github user dribnet illustrates how to wrap a new dataset using Fuel: https://github.com/dribnet/lfw_fuel
Advantages:
- Only takes one command to download the data and import it into fuel
- Then it only takes one command to import the library that wraps the data, and be able to turn it into training/testing X and Y
Disadvantages:
- One-size-fits-all; importing data using load_data() can take a REALLY long time, and must be done every time you run the script (not persistent in memory)
- Complicated to extend
- Removes some of the nicer options of fuel
Here is what the final payoff looks like:
from keras.models import Sequential from lfw_fuel import lfw # the data, shuffled and split between train and test sets (X_train, y_train), (X_test, y_test) = lfw.load_data(format="deepfunneled") # (build the perfect model here) model.fit(X_train, Y_train, show_accuracy=True, validation_data=(X_test, Y_test)) score = model.evaluate(X_test, Y_test, show_accuracy=True, verbose=0)