From charlesreid1

No edit summary
No edit summary
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
=Installing=
=Installing=


This was a royal pain in the ass.
Installing Pandas can be thorny if you're running on a Mac, mainly because if you download and install your own version of Python, it will conflict with Mac's built-in version of Python. (I recommend leaving Mac's Python version alone.) Mac's version does NOT have pip. This means that if you use pip to install Pandas, it will install it for one version of Python, but not all versions of Python. If you don't run the right Python, Pandas will not be available.


I had some problems with my easy_install and pip, probably b/c they were not associated with the right versions of Python. I was able to use [[Python#Virtual_Python:_virtualenv|Virtualenv]] to debug some of the problems, and basically ended up having to remove my custom-installed version of numpy in favor of the pip-installed version of numpy. This broke my custom-installed version of scipy and ipython as well, so I had to install those using pip. These ended up not working, and I had to go back to my old, by-hand installations. So, in the end, I don't actually know what the right procedure is; I just have a vague sense that there were some problems, that got resolved, by something I did, at some point.
When you install your own version of Python, make sure that it is the first python on your path, by typing:


Like I said, it was a big damn mess.
<pre>
which -a python
</pre>
 
This will ensure that the pip on your path corresponds to the right python on your path.  


First, I downloaded and installed easy_install from source.
First, I downloaded and installed easy_install from source.
Then blast your PYTHONPATH (keep things simple):
<source lang="bash">
$ unset PYTHONPATH
</source>


Then, I ran the following commands:
Then, I ran the following commands:


<pre>
<source lang="bash">
$ sudo easy_install pip
$ sudo easy_install pip
$ sudo pip install numpy
$ sudo pip install numpy
$ sudo pip install numexpr
$ sudo pip install cython
$ sudo pip install tables
$ sudo pip install pandas
$ sudo pip install pandas
</source>
Or to upgrade:
<source lang='bash">
$ sudo pip install --upgrade pandas
</source>
=Data=
==Creating a Table of Arbitrary Data Types==
Let's say you're trying to create a data table where you store the result of a simulation. This simulation has a set of inputs and outputs, each with a different data type. For example, the following inputs are scalars:
* Flowrate_in (float)
* Temperature_in (float)
* Pressure_in (float)
But temperature and species profiles are vectors, not scalars:
* Temperature_profile (numpy array)
* Oxygen_profile (numpy array)
Two ways of populating a Pandas data object (a DataFrame, in this case) are:
* Create arbitrary, concrete data with the type you are interested in storing
* Grab the types of the data you are interested in storing
===Initializing with Data===
A simple illustration of the first technique:
<pre>
In[99]: reactors = [ { "flowrate_in" : 0.0, "temperature_in" : 0.0, "pressure_in" : 0.0, "temperature_profile" : zeros(100,), "oxygen_profile" : zeros(100,) } for i in arange(10) ]
</pre>
</pre>


and installing tables stuff:
This creates a list of 10 dicts containing the same initial values, which can then be used to initialize a DataFrame object:


<pre>
<pre>
$ sudo pip install numexpr
In[100]: pandas.DataFrame(reactors)
$ sudo pip install tables
Out[100]:
 
  flowrate_in            oxygen_profile  pressure_in  temperature_in  \
0            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
1            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
2            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
3            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
4            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
5            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
6            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
7            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
8            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
9            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
 
        temperature_profile
0  [0.0, 0.0, 0.0, 0.0, 0.0]
1  [0.0, 0.0, 0.0, 0.0, 0.0]
2  [0.0, 0.0, 0.0, 0.0, 0.0]
3  [0.0, 0.0, 0.0, 0.0, 0.0]
4  [0.0, 0.0, 0.0, 0.0, 0.0]
5  [0.0, 0.0, 0.0, 0.0, 0.0]
6  [0.0, 0.0, 0.0, 0.0, 0.0]
7  [0.0, 0.0, 0.0, 0.0, 0.0]
8  [0.0, 0.0, 0.0, 0.0, 0.0]
9  [0.0, 0.0, 0.0, 0.0, 0.0]
 
</pre>
 
===Initializing with Types===
 
A simple illustration of the second technique:
 
<pre>
In[101]: df = reactors = [ { "flowrate_in" : numpy.float32, "temperature_in" : numpy.float32, "pressure_in" : numpy.float32, "temperature_profile" : numpy.ndarray, "oxygen_profile" : numpy.ndarray } for i in range(10) ]
</pre>
 
This creates a list of 10 dicts that are all empty:
 
<pre>
In[102]: df = pandas.DataFrame(reactors)
Out[102]:
 
              flowrate_in          oxygen_profile            pressure_in  \
0  <type 'numpy.float32'>  <type 'numpy.ndarray'>  <type 'numpy.float32'>
1  <type 'numpy.float32'>  <type 'numpy.ndarray'>  <type 'numpy.float32'>
2  <type 'numpy.float32'>  <type 'numpy.ndarray'>  <type 'numpy.float32'>
3  <type 'numpy.float32'>  <type 'numpy.ndarray'>  <type 'numpy.float32'>
4  <type 'numpy.float32'>  <type 'numpy.ndarray'>  <type 'numpy.float32'>
5  <type 'numpy.float32'>  <type 'numpy.ndarray'>  <type 'numpy.float32'>
6  <type 'numpy.float32'>  <type 'numpy.ndarray'>  <type 'numpy.float32'>
7  <type 'numpy.float32'>  <type 'numpy.ndarray'>  <type 'numpy.float32'>
8  <type 'numpy.float32'>  <type 'numpy.ndarray'>  <type 'numpy.float32'>
9  <type 'numpy.float32'>  <type 'numpy.ndarray'>  <type 'numpy.float32'>
 
          temperature_in    temperature_profile
0  <type 'numpy.float32'>  <type 'numpy.ndarray'>
1  <type 'numpy.float32'>  <type 'numpy.ndarray'>
2  <type 'numpy.float32'>  <type 'numpy.ndarray'>
3  <type 'numpy.float32'>  <type 'numpy.ndarray'>
4  <type 'numpy.float32'>  <type 'numpy.ndarray'>
5  <type 'numpy.float32'>  <type 'numpy.ndarray'>
6  <type 'numpy.float32'>  <type 'numpy.ndarray'>
7  <type 'numpy.float32'>  <type 'numpy.ndarray'>
8  <type 'numpy.float32'>  <type 'numpy.ndarray'>
9  <type 'numpy.float32'>  <type 'numpy.ndarray'>
</pre>
 
==Modifying a Table with Data==
 
When you treat data as a 2D array of arbitrary data types, each of those numpy.ndarray objects can be whatever size it wants - all that Pandas cares about is the fact that it is a numpy array. Beyond that, Pandas doesn't care about the shape or size of the array.
 
This means that, in practice, you could have temperature or oxygen profiles of entirely different sizes:
 
<pre>
In [117]: df
Out[117]:
  flowrate_in            oxygen_profile  pressure_in  temperature_in  \
0            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
1            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
2            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
3            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
4            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
5            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
6            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
7            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
8            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
9            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
 
        temperature_profile
0  [0.0, 0.0, 0.0, 0.0, 0.0]
1  [0.0, 0.0, 0.0, 0.0, 0.0]
2  [0.0, 0.0, 0.0, 0.0, 0.0]
3  [0.0, 0.0, 0.0, 0.0, 0.0]
4  [0.0, 0.0, 0.0, 0.0, 0.0]
5  [0.0, 0.0, 0.0, 0.0, 0.0]
6  [0.0, 0.0, 0.0, 0.0, 0.0]
7  [0.0, 0.0, 0.0, 0.0, 0.0]
8  [0.0, 0.0, 0.0, 0.0, 0.0]
9  [0.0, 0.0, 0.0, 0.0, 0.0]
</pre>
 
Now set the temperature profiles to be profiles of different lengths:
 
<pre>
In [122]: df['temperature_profile'][0] = 25*ones(3,)
 
In [123]: df['temperature_profile'][1] = 28*ones(5,)
 
In [124]: df['temperature_profile'][2] = 30*ones(8,)
 
In [125]: df
Out[125]:
  flowrate_in            oxygen_profile  pressure_in  temperature_in  \
0            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
1            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
2            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
3            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
4            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
5            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
6            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
7            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
8            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
9            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0              0
 
                                temperature_profile
0                                [25.0, 25.0, 25.0]
1                    [28.0, 28.0, 28.0, 28.0, 28.0]
2  [30.0, 30.0, 30.0, 30.0, 30.0, 30.0, 30.0, 30.0]
3                        [0.0, 0.0, 0.0, 0.0, 0.0]
4                        [0.0, 0.0, 0.0, 0.0, 0.0]
5                        [0.0, 0.0, 0.0, 0.0, 0.0]
6                        [0.0, 0.0, 0.0, 0.0, 0.0]
7                        [0.0, 0.0, 0.0, 0.0, 0.0]
8                        [0.0, 0.0, 0.0, 0.0, 0.0]
9                        [0.0, 0.0, 0.0, 0.0, 0.0]
</pre>
 
==Saving a Table with Data==
 
===H5===
 
To save a DataFrame using HDF5:
 
<pre>
df.to_hdf('dummy.h5','name_of_array',append=False)
df_h5 = pandas.read_hdf('dummy.h5', 'name_of_array')
</pre>
 
===CSV===
 
<pre>
df.to_csv('dummy.csv')
df_csv = pandas.read_csv('dummy.csv')
</pre>
</pre>
{{LinearAlgebraFlag}}
{{ScientificComputingFlag}}
{{PythonFlag}}

Latest revision as of 07:17, 16 April 2017

Installing

Installing Pandas can be thorny if you're running on a Mac, mainly because if you download and install your own version of Python, it will conflict with Mac's built-in version of Python. (I recommend leaving Mac's Python version alone.) Mac's version does NOT have pip. This means that if you use pip to install Pandas, it will install it for one version of Python, but not all versions of Python. If you don't run the right Python, Pandas will not be available.

When you install your own version of Python, make sure that it is the first python on your path, by typing:

which -a python

This will ensure that the pip on your path corresponds to the right python on your path.

First, I downloaded and installed easy_install from source.

Then blast your PYTHONPATH (keep things simple):

$ unset PYTHONPATH

Then, I ran the following commands:

$ sudo easy_install pip
$ sudo pip install numpy
$ sudo pip install numexpr
$ sudo pip install cython
$ sudo pip install tables
$ sudo pip install pandas

Or to upgrade:

$ sudo pip install --upgrade pandas

Data

Creating a Table of Arbitrary Data Types

Let's say you're trying to create a data table where you store the result of a simulation. This simulation has a set of inputs and outputs, each with a different data type. For example, the following inputs are scalars:

  • Flowrate_in (float)
  • Temperature_in (float)
  • Pressure_in (float)

But temperature and species profiles are vectors, not scalars:

  • Temperature_profile (numpy array)
  • Oxygen_profile (numpy array)

Two ways of populating a Pandas data object (a DataFrame, in this case) are:

  • Create arbitrary, concrete data with the type you are interested in storing
  • Grab the types of the data you are interested in storing

Initializing with Data

A simple illustration of the first technique:

In[99]: reactors = [ { "flowrate_in" : 0.0, "temperature_in" : 0.0, "pressure_in" : 0.0, "temperature_profile" : zeros(100,), "oxygen_profile" : zeros(100,) } for i in arange(10) ]

This creates a list of 10 dicts containing the same initial values, which can then be used to initialize a DataFrame object:

In[100]: pandas.DataFrame(reactors)
Out[100]:

   flowrate_in             oxygen_profile  pressure_in  temperature_in  \
0            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
1            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
2            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
3            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
4            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
5            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
6            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
7            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
8            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
9            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0

         temperature_profile
0  [0.0, 0.0, 0.0, 0.0, 0.0]
1  [0.0, 0.0, 0.0, 0.0, 0.0]
2  [0.0, 0.0, 0.0, 0.0, 0.0]
3  [0.0, 0.0, 0.0, 0.0, 0.0]
4  [0.0, 0.0, 0.0, 0.0, 0.0]
5  [0.0, 0.0, 0.0, 0.0, 0.0]
6  [0.0, 0.0, 0.0, 0.0, 0.0]
7  [0.0, 0.0, 0.0, 0.0, 0.0]
8  [0.0, 0.0, 0.0, 0.0, 0.0]
9  [0.0, 0.0, 0.0, 0.0, 0.0]

Initializing with Types

A simple illustration of the second technique:

In[101]: df = reactors = [ { "flowrate_in" : numpy.float32, "temperature_in" : numpy.float32, "pressure_in" : numpy.float32, "temperature_profile" : numpy.ndarray, "oxygen_profile" : numpy.ndarray } for i in range(10) ]

This creates a list of 10 dicts that are all empty:

In[102]: df = pandas.DataFrame(reactors)
Out[102]:

              flowrate_in          oxygen_profile             pressure_in  \
0  <type 'numpy.float32'>  <type 'numpy.ndarray'>  <type 'numpy.float32'>
1  <type 'numpy.float32'>  <type 'numpy.ndarray'>  <type 'numpy.float32'>
2  <type 'numpy.float32'>  <type 'numpy.ndarray'>  <type 'numpy.float32'>
3  <type 'numpy.float32'>  <type 'numpy.ndarray'>  <type 'numpy.float32'>
4  <type 'numpy.float32'>  <type 'numpy.ndarray'>  <type 'numpy.float32'>
5  <type 'numpy.float32'>  <type 'numpy.ndarray'>  <type 'numpy.float32'>
6  <type 'numpy.float32'>  <type 'numpy.ndarray'>  <type 'numpy.float32'>
7  <type 'numpy.float32'>  <type 'numpy.ndarray'>  <type 'numpy.float32'>
8  <type 'numpy.float32'>  <type 'numpy.ndarray'>  <type 'numpy.float32'>
9  <type 'numpy.float32'>  <type 'numpy.ndarray'>  <type 'numpy.float32'>

           temperature_in     temperature_profile
0  <type 'numpy.float32'>  <type 'numpy.ndarray'>
1  <type 'numpy.float32'>  <type 'numpy.ndarray'>
2  <type 'numpy.float32'>  <type 'numpy.ndarray'>
3  <type 'numpy.float32'>  <type 'numpy.ndarray'>
4  <type 'numpy.float32'>  <type 'numpy.ndarray'>
5  <type 'numpy.float32'>  <type 'numpy.ndarray'>
6  <type 'numpy.float32'>  <type 'numpy.ndarray'>
7  <type 'numpy.float32'>  <type 'numpy.ndarray'>
8  <type 'numpy.float32'>  <type 'numpy.ndarray'>
9  <type 'numpy.float32'>  <type 'numpy.ndarray'>

Modifying a Table with Data

When you treat data as a 2D array of arbitrary data types, each of those numpy.ndarray objects can be whatever size it wants - all that Pandas cares about is the fact that it is a numpy array. Beyond that, Pandas doesn't care about the shape or size of the array.

This means that, in practice, you could have temperature or oxygen profiles of entirely different sizes:

In [117]: df
Out[117]:
   flowrate_in             oxygen_profile  pressure_in  temperature_in  \
0            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
1            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
2            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
3            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
4            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
5            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
6            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
7            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
8            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
9            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0

         temperature_profile
0  [0.0, 0.0, 0.0, 0.0, 0.0]
1  [0.0, 0.0, 0.0, 0.0, 0.0]
2  [0.0, 0.0, 0.0, 0.0, 0.0]
3  [0.0, 0.0, 0.0, 0.0, 0.0]
4  [0.0, 0.0, 0.0, 0.0, 0.0]
5  [0.0, 0.0, 0.0, 0.0, 0.0]
6  [0.0, 0.0, 0.0, 0.0, 0.0]
7  [0.0, 0.0, 0.0, 0.0, 0.0]
8  [0.0, 0.0, 0.0, 0.0, 0.0]
9  [0.0, 0.0, 0.0, 0.0, 0.0]

Now set the temperature profiles to be profiles of different lengths:

In [122]: df['temperature_profile'][0] = 25*ones(3,)

In [123]: df['temperature_profile'][1] = 28*ones(5,)

In [124]: df['temperature_profile'][2] = 30*ones(8,)

In [125]: df
Out[125]:
   flowrate_in             oxygen_profile  pressure_in  temperature_in  \
0            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
1            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
2            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
3            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
4            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
5            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
6            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
7            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
8            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0
9            0  [0.0, 0.0, 0.0, 0.0, 0.0]            0               0

                                temperature_profile
0                                [25.0, 25.0, 25.0]
1                    [28.0, 28.0, 28.0, 28.0, 28.0]
2  [30.0, 30.0, 30.0, 30.0, 30.0, 30.0, 30.0, 30.0]
3                         [0.0, 0.0, 0.0, 0.0, 0.0]
4                         [0.0, 0.0, 0.0, 0.0, 0.0]
5                         [0.0, 0.0, 0.0, 0.0, 0.0]
6                         [0.0, 0.0, 0.0, 0.0, 0.0]
7                         [0.0, 0.0, 0.0, 0.0, 0.0]
8                         [0.0, 0.0, 0.0, 0.0, 0.0]
9                         [0.0, 0.0, 0.0, 0.0, 0.0]

Saving a Table with Data

H5

To save a DataFrame using HDF5:

df.to_hdf('dummy.h5','name_of_array',append=False)
df_h5 = pandas.read_hdf('dummy.h5', 'name_of_array')

CSV

df.to_csv('dummy.csv')
df_csv = pandas.read_csv('dummy.csv')