From charlesreid1

Revision as of 23:08, 14 October 2017 by Admin (talk | contribs)

Basic Usage

Datasets

Datasets are the principal interface to data. Internally, they use a DataStream object to create and request iterators.

IterableDataset Example

Code: https://gist.github.com/charlesreid1/eefc22defc8c6bd07c6bd0ac222c9781

Suppose we create eight (8) different 2x2 greyscale images, and put them in the variable "features", then create 4 target classes, and put them in "targets":

In [1]: import numpy

In [2]: seed = 1234

In [3]: rng = numpy.random.RandomState(seed)

In [4]: features = rng.randint(256, size=(8, 2, 2))

In [5]: targets = rng.randint(4, size=(8, 1))

Now we can create a Dataset to iterate over the data:

In [6]: from collections import OrderedDict

In [7]: from fuel.datasets import IterableDataset

In [8]: dataset = IterableDataset(
   ...: iterables=OrderedDict([('features', features), ('targets', targets)]),
   ...: axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
   ...: ('targets', ('batch', 'index'))]))

and we can access each attribute using the dataset object:

In [9]: print('Provided sources are {}.'.format(dataset.provides_sources))
Provided sources are ('features', 'targets').

In [10]: print('Sources are {}.'.format(dataset.sources))
Sources are ('features', 'targets').

In [11]: print('Axis labels are {}.'.format(dataset.axis_labels))
Axis labels are OrderedDict([('features', ('batch', 'height', 'width')), ('targets', ('batch', 'index'))]).

In [12]: print('Dataset contains {} examples.'.format(dataset.num_examples))
Dataset contains 8 examples.

In [14]: from pprint import pprint

In [15]: pprint(dir(dataset))
[

...snip...

 'apply_default_transformers',
 'axis_labels',
 'close',
 'default_transformers',
 'example_iteration_scheme',
 'filter_sources',
 'get_data',
 'get_example_stream',
 'iterables',
 'next_epoch',
 'num_examples',
 'open',
 'provides_sources',
 'reset',
 'sources']

Note that the dataset is stateless, so we need to create an external object to represent the state, then pass that into the dataset when we want to iterate over/access the data:

In [17]: state = dataset.open()

In [18]: while True:
    ...:     try:
    ...:         print(dataset.get_data(state=state))
    ...:     except StopIteration:
    ...:         print('Iterator finished')
    ...:         break
    ...:
(array([[ 47, 211],
       [ 38,  53]]), array([0]))
(array([[204, 116],
       [152, 249]]), array([3]))
(array([[143, 177],
       [ 23, 233]]), array([0]))
(array([[154,  30],
       [171, 158]]), array([1]))
(array([[236, 124],
       [ 26, 118]]), array([2]))
(array([[186, 120],
       [112, 220]]), array([2]))
(array([[ 69,  80],
       [201, 127]]), array([2]))
(array([[246, 254],
       [175,  50]]), array([3]))
Iterator finished

To reset the state, use the Dataset object's reset() function. To finish, use the close() function.

In [19]: state = dataset.reset(state=state)

In [20]: print(dataset.get_data(state=state))
(array([[ 47, 211],
       [ 38,  53]]), array([0]))

In [21]: dataset.close(state=state)

IndexableDataset Example

Code: https://gist.github.com/charlesreid1/eefc22defc8c6bd07c6bd0ac222c9781

IndexableDataset objects do not work the same way as IterableDataset objects - there is no need to store a persistent state because all the data can be accessed randomly, in any order you please.


In [1]: from fuel.datasets import IndexableDataset
   ...: from collections import OrderedDict

In [2]: import numpy
   ...: seed = 1234
   ...: rng = numpy.random.RandomState(seed)

In [3]: features = rng.randint(256, size=(8, 2, 2))
   ...: targets = rng.randint(4, size=(8, 1))

In [4]: dataset = IndexableDataset(
   ...:     indexables=OrderedDict([('features', features), ('targets', targets)]),
   ...:     axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
   ...:                              ('targets', ('batch', 'index'))]))

In [5]: state = dataset.open()

In [6]: print("State is {}".format(state))
   ...: print("NOTE: None state returned, because there is no state to maintain!")

State is None
NOTE: None state returned, because there is no state to maintain!

In [7]: print(dataset.get_data(state=state, request=[3,1,0]))
(array([[[154,  30],
        [171, 158]],

       [[204, 116],
        [152, 249]],

       [[ 47, 211],
        [ 38,  53]]]), array([[1],
       [3],
       [0]]))

In [8]: print(dataset.get_data(state=state, request=[1,2,4,7]))
(array([[[204, 116],
        [152, 249]],

       [[143, 177],
        [ 23, 233]],

       [[236, 124],
        [ 26, 118]],

       [[246, 254],
        [175,  50]]]), array([[3],
       [0],
       [2],
       [3]]))

In [9]: dataset.close(state=state)

No need to reset any iterator.


Note the main difference between the constructor arguments: IndexableDataset requires indexables dict, IterableDataset requires iterables dict:

dataset = IndexableDataset(
     indexables=OrderedDict([('features', features), ('targets', targets)]),
     axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
                              ('targets', ('batch', 'index'))]))

dataset = IterableDataset(
            iterables=OrderedDict([('features', features), ('targets', targets)]),
            axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
                                     ('targets', ('batch', 'index'))]))

Iteration Schemes

ShuffledScheme Example

Let's illustrate how to use iteration schemes - but first, how NOT to use iteration schemes.

Incorrect Usage

Suppose we created an IterableDataset, as in the first example, and tried to iterate over it in arbitrary order:

In [8]: dataset = IterableDataset(
   ...: iterables=OrderedDict([('features', features), ('targets', targets)]),
   ...: axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
   ...: ('targets', ('batch', 'index'))]))

The problem with doing this is, the get_data() function for an IterableDataset does not support any extra arguments (like request), so we can't request data out of the standard iteration order. What happens if we do? We get a ValueError...

In [23]: from fuel.schemes import ShuffledScheme

In [24]: state = dataset.open()

In [25]: scheme = ShuffledScheme(examples=dataset.num_examples, batch_size=4)

In [26]: for request in scheme.get_request_iterator():
    ...:     data = dataset.get_data(state=state, request=request)
    ...:     print(data[0].shape, data[1].shape)
    ...:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-27-24827dafdaa8> in <module>()
      1 for request in scheme.get_request_iterator():
----> 2     data = dataset.get_data(state=state, request=request)
      3     print(data[0].shape, data[1].shape)
      4

/usr/local/lib/python3.6/site-packages/fuel-0.2.0-py3.6-macosx-10.12-x86_64.egg/fuel/datasets/base.py in get_data(self, state, request)
    310     def get_data(self, state=None, request=None):
    311         if state is None or request is not None:
--> 312             raise ValueError
    313         return next(state)
    314

ValueError:

Correct Usage

Code: https://gist.github.com/charlesreid1/eefc22defc8c6bd07c6bd0ac222c9781

If we create our data set using an IndexableDataset object, this is the correct way to do it, and everything goes smoothly.

from fuel.datasets import IndexableDataset
from fuel.schemes import ShuffledScheme
from collections import OrderedDict

import numpy
seed = 1234
rng = numpy.random.RandomState(seed)

# Make some fake data
features = rng.randint(256, size=(8, 2, 2))
targets = rng.randint(4, size=(8, 1))

# Make a Dataset - in particular, an IndexableDataset
dataset = IndexableDataset(
            indexables=OrderedDict([('features', features), ('targets', targets)]),
            axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
                                     ('targets', ('batch', 'index'))]))

state = dataset.open()
scheme = ShuffledScheme(examples=dataset.num_examples, batch_size=4)

# Use get_request_iterator() to generate requests
# in shuffled order using the ShuffledScheme.

for request in scheme.get_request_iterator():
    print(request)

print("\n")

for request in scheme.get_request_iterator():
    data = dataset.get_data(state=state, request=request)
    print(data[0].shape, data[1].shape)

Here is the corresponding output:

$ py iterator_example.py
[7, 2, 1, 6]
[0, 4, 3, 5]


(4, 2, 2) (4, 1)
(4, 2, 2) (4, 1)

Note the first two lines of output are what the get_request_iterator() method returned - we asked the scheme to get data in batch sizes of 4, using batch_size=4, and we specified the batch was the first of the three dimensions of the entire (8, 2, 2) data set of "fake" data.

scheme = ShuffledScheme(examples=dataset.num_examples, batch_size=4)

This means it's going to grab 4 chunks of data, each (2,2). Sure enough, with the second two lines of output we see the shapes of the data being returned. Let's examine what that data actually contains. If instead of printing shapes, we print data[0], we see the actual data from the "fake" grayscale images (INPUTS):

[[[143 177]
  [ 23 233]]

 [[154  30]
  [171 158]]

 [[236 124]
  [ 26 118]]

 [[246 254]
  [175  50]]]

--- --- --- --- --- --- ---

[[[204 116]
  [152 249]]

 [[ 69  80]
  [201 127]]

 [[ 47 211]
  [ 38  53]]

 [[186 120]
  [112 220]]]

Now, if we print data[1], we see which of the four predicted classes each image is a part of (0 through 3) (OUTPUTS):

[[0]
 [1]
 [2]
 [3]]

--- --- --- --- --- --- ---

[[3]
 [2]
 [0]
 [2]]

Flags