Saturday, October 18, 2014

Efficient conversion of Python readable binary file format to Avro (Pig, Avro, Python)

This entry covers converting a binary format to the Pig Avro file format for follow on transformation and analysis. As mentioned in prior entries on using a local databases (H2, SQL transformations), this is can be efficient because a subsample is taken and easily transformed using Pig's data flow language for local analysis. Pig -x local mode is used.

Here is an example where we make up a binary format from within Python, write it out to Avro and do extract transform load (ETL) on it from PIG. The result can be written out as a of set files in a target binary format using a Python UDF (i.e., a write a set of Matlab arrays). Without further ado,

pip install avro-python3

Then create test_avro_write.py

#!/usr/env/bin python
# Adopted after http://www.harshj.com/2010/04/25/writing-and-reading-avro-data-files-using-python/
from random import randint
from avro import schema, datafile, io

OUTFILE_NAME = 'mydata.avro'

# written to disk
SCHEMA_STR = """{
    "type": "record",
    "name": "data",
    "namespace": "AVRO",
    "fields": [
        {   "name": "name"   , "type": "string"   },
        {   "name": "age"    , "type": "int"      },
        {   "name": "address", "type": "string"   },
        {   "name": "value"  , "type": "long"     }
    ]
            }"""

SCHEMA = schema.Parse(SCHEMA_STR)

def write_avro_file(OUTFILE_NAME):
    # Lets generate our data
    data = {}
    data['name']    = ''
    data['age']     = 0
    data['address'] = '10, Bar Eggs Spam'
    data['value']   = 0

    # Create a 'record' (datum) writer
    rec_writer = io.DatumWriter(SCHEMA)

    # Create a 'data file' (avro file) writer
    df_writer = datafile.DataFileWriter(
                    open(OUTFILE_NAME, 'wb'),
                    rec_writer,
                    writer_schema = SCHEMA,
                    codec = 'deflate'
                )

    # Write our data, made up binary format
    for char in range(45):
        data['name'] = chr(char) + '_foo'
        data['age'] = randint(13,99)
        data['value'] = randint(47,800)
        df_writer.append(data)

    # Close to ensure writing is complete
    df_writer.close()

if __name__ == '__main__':
    # Write an AVRO file first
    write_avro_file(OUTFILE_NAME)
Once test_avro_write.py is run you'll have an Avro file with randomized data. Use Pig 0.12+ to do some basic ETL on the data.

data = LOAD 'mydata.avro' USING AvroStorage();

age_filter = FILTER data BY ((age < 50) and (value > 300));
age_filter = FOREACH age_filter GENERATE (name, address);

age_group = GROUP age_filter ALL;
age_count = FOREACH age_group GENERATE COUNT(age_filter);
DUMP age_count; -- Replace DUMP with Python UDF to target binary format

While this example is quite trival it shows where a Python library can be used to import (or generate) binary data, how it can be efficiently written to disk (as opposed to .csv), transformed using Pig. Using a Pig Python UDF to write out to a target binary format allows the transformed data to be analyzed in any package of choice.

Note: To dump an array to matlab format, see scipy.io.savemat()

Wednesday, October 8, 2014

Setting up H2 (MySQL alternative) database for Ubuntu 14, Part 4

EDIT - Note there is now a Python3 version of JayDeBeApit at https://pypi.python.org/pypi/JayDeBeApi3. So the below is no longer an issue.

I was able to get Python to connect to the H2 database and insert a couple of rows. There is more troubleshooting to do on why curs.fetchall() doesn't return result. It could be that the insert statements weren't committed.

To get to where I got, do the following. Note I will fork a Python 3 branch of JayDeBeApi so the modifications in this post will be irrelevant in the future.

# start up your Python 3 environment
echo "JayDeBeApi" > requirements.txt
pip install -d . -r requirements.txt
tar -zxvf JayDeBeApi-0.1.4.tar.gz
2to3 -f all -w JayDeBeApi-0.1.4

Make the following source changes;
# In setup.py:33, change "file" to "open"
# In dbapi2.py: 21, comment out "exceptions" # not needed
# In dbapi2.py:185, 188, remove "exceptions."
# In dbapi2.py: 382, remove "next", should just read "if not self._rs"

Then,
pip install -e JayDeBeApi-0.1.4
python
conn = jaydebeapi.connect('org.h2.Driver', ['jdbc:h2://home/your/path/var/h2demodb', 'user', 'pw'], '/home/kwame/H2/h2-2014-08-06.jar')
curs = conn.cursor()

curs.execute('CREATE TABLE MYTEST(ID INT PRIMARY KEY, NAME VARCHAR(255));')
curs.execute('INSERT INTO MYTEST VALUES(1, \'Hello World\');')
curs.execute('SELECT * FROM MYTEST ORDER BY ID;')
curs.fetchall() # SQLExceptionPyRaisable: org.h2.jdbc.JdbcSQLException: No data is available [2000-181]

I will follow up when I figure out why there is no data in MYTEST. The table persists between close(). EDIT - PyODBC turns auto-commit off by default

Once I have that I'll have the ability to take data from a variety of scientific packages and formats (Matlab, pcap sessions, etc.) and dump them into a H2 database for follow on munging/wrangling.

Tuesday, October 7, 2014

Setting up H2 (MySQL alternative) database for Ubuntu 14, Part 3

So I've done some investigating and it occurred to me that the simplest solution would be to call into the data base h2.jar file from Python itself. This avoids installing other outside software/database drivers and provides a verifiable situation since the database is already accessible from SqlSquirrel (great for troubleshooting).

It turns out that this is possible with the Jpype and JayDeBeApi modules, however JayDeBeApi only supports Python 2. I'm looking into the best way to port it to Python 3 (it may be as simple as using distribute to automatically convert it).

Anyway, so for today:

py34 # start python 3 virtual environment
sudo apt-get install g++ python3-dev # required for Jpype installation
# make sure the jdk headers are available from your $JAVA_HOME
pip install Jpype1-py3

To test Jpype1-py3,
import jpype
jpype.startJVM(jpype.getDefaultJVMPath())

# We can now access basic JVM funcations. This took like 3 minutes to load on my system
jpype.java.lang.System.out.println("Hello from JPype!")

# since we're done we shut down the JVM
jpype.shutdownJVM()
JPype uses Java JNI to communicate to the JVM. See [1] for more info.

To install JayDeBeApi I will investigate using,

from __future__ import (absolute_import, division,
                        print_function, unicode_literals)

or putting 2to3=true in the setup/distribute config.

[1] http://jpype.sourceforge.net/doc/user-guide/userguide.html

Monday, October 6, 2014

Setting up H2 (MySQL alternative) database for Ubuntu 14, Part 2

With squirrel-sql started as given in the last post we will create a H2 Database with the following via the GUI:

1. Click on the 'Drivers' side tab
2. Select H2 Embedded
3. Click on the Extra Class Path tab
4. Click on Add
5. Select the H2 driver jar that you extracted from ~/H2.
6. Click OK

To make a new Alias to the database, do the following

1. In the Name field enter 'testdb'
2. In the Driver drop down select H2 Embedded
3. For the User and Password fields enter 'sa' and leave the password blank
4. *For the URL field replace <db-name> with /home/<your home dir>/var

* before doing this go to the terminal and type

mkdir -p ~/var/


Now, here's the kicker. There are easy ways to load text formatted data in the H2 database it gets a little more difficult if you're working with binary files. You could write a converter in Java but the advantage of using Python is that you can use a variety of Python libraries to interface with scientific formats and dump them into the database. Our goal is efficiently create a table and dump data into it to the degree possible (I'm just one guy, not a professional sys admin).

There are several options for doing this: A) Install a PostGreSQL driver along with a PostGre Python module [1], B) Use the H2 TCP server to dump data over a socket and C) Attempt to interact with the H2 Shell via subprocess.

At first glance, option A provides compiled library access to the H2 database which should be the fastest option. Option B requires extra sockets and  overhead from socket teardown/set up. It is probably the slowest option. Option C is a hack but avoids the socket overhead and might work with H2 Embedded directly (H2 Embedded is really fast), I'm not sure. Reasons for using H2 include unbounded number of rows/columns, parallel access and being a lightweight database. Because H2 is written in Java I can't pass it as an ODBC driver (as far as I know).

So, my next entry will cover what I come up with for accessing the H2 database. If options A, B and C are too complex I will investigate alternatives.

[1] https://wiki.postgresql.org/wiki/Python




Sunday, October 5, 2014

Setting up H2 (MySQL alternative) database for Ubuntu 14, Part I

Often times in exploratory data analysis and when developing estimation models it makes sense to work with sampled subsets. Assuming the sample is a good representative of the population it can allow for a processing speedup relative to your local processing power (running a ML algorithm on 20 Mb of data is a lot faster than 20 Gb). The sampling insights may even inform how your estimator samples at scale and/or in real time.

Data wrangling can be tedious even in R or Pandas and aggregation can be prone to errors. For this brief two part (or maybe 4 parts, if I take a large value of  two) series I will take after Win-Vector's medium scale data technique to demonstrate how to set up an embedded H2 database that is manipulated via the simple Python ODBC library (instead of SQL Screwdriver). This is done to script the creation of a fast, local and efficient database containing a representative sample of a much larger data set.

I'm planning on using this technique against a medium scale data set of human brain MEG scans (see Kaggle's Decoding the Human Brain) and verify results in the current literature (namely that our responses to human faces are detectable in a few cortical regions at 100 ms and 170 ms of time after initially seeing a face). If I can reproduce the results and if I have time I will attempt to construct a simple estimator upon the results.

Okay. So,

1. We start by installing squirrel-sql; instructions are provided at the link. In Ubuntu, make sure you have a version of java installed that has gui libraries (it appears to install the headless version by default.) You can do,
# Ubuntu jdk/jre (make sure not headless)
sudo apt-get install openjdk-7-jdk
java -jar squirrel-sql-<version>-install.jar
# install to ~/squirrel-sql
cd ~/squirrel-sql; ./squirrel-sql

2. Get the H2 driver. It supports parallel access and an unbounded number of columns and rows. Download the latest,

wget -P /tmp/H2 http://www.h2database.com/h2-2014-08-06.zip
mkdir ~/H2
cd ~/H2
unzip -p /tmp/H2/h2-2014-08-06.zip h2/bin/h2-1.4.181.jar > h2-1.4.181.jar

In the next post I'll demonstrate how to install ODBC drivers, create a new H2 database using Squirrel-SQL and how to use Python to populate the database (table loading). I use Python over other alternatives because it allows us to easily import a wide variety of binary scientific file formats and dump them into an H2 table.

Saturday, October 4, 2014

mortar local:configure on stock Ubuntu Trusty

I'm using Ubuntu 14 at home and while going through the process of setting up a MortarData local install I found this to be helpful (enter at the shell):

# python3 is installed by default and includes a virtualenv implementation, pyvenv
# Ubuntu does not install javac by default
sudo apt-get install openjdk-7-jdk
sudo apt-get update
sudo apt-get install openjdk-7-jdk # second run installs javac

export JAVA_HOME=/bin

cd ~
pyvenv-3.4 py34
source py34/bin/activate
pip install virtualenv

mortar local:configure # everything should proceed without error