The Probabilistic Traveler

Tuesday, October 7, 2014

Setting up H2 (MySQL alternative) database for Ubuntu 14, Part 3

So I've done some investigating and it occurred to me that the simplest solution would be to call into the data base h2.jar file from Python itself. This avoids installing other outside software/database drivers and provides a verifiable situation since the database is already accessible from SqlSquirrel (great for troubleshooting).

It turns out that this is possible with the Jpype and JayDeBeApi modules, however JayDeBeApi only supports Python 2. I'm looking into the best way to port it to Python 3 (it may be as simple as using distribute to automatically convert it).

Anyway, so for today:

py34 # start python 3 virtual environment
sudo apt-get install g++ python3-dev # required for Jpype installation
# make sure the jdk headers are available from your $JAVA_HOME
pip install Jpype1-py3

To test Jpype1-py3,

import jpype
jpype.startJVM(jpype.getDefaultJVMPath())

# We can now access basic JVM funcations. This took like 3 minutes to load on my system
jpype.java.lang.System.out.println("Hello from JPype!")

# since we're done we shut down the JVM
jpype.shutdownJVM()

JPype uses Java JNI to communicate to the JVM. See [1] for more info.

To install JayDeBeApi I will investigate using,

from __future__ import (absolute_import, division,
                        print_function, unicode_literals)

or putting 2to3=true in the setup/distribute config.

[1] http://jpype.sourceforge.net/doc/user-guide/userguide.html

Monday, October 6, 2014

Setting up H2 (MySQL alternative) database for Ubuntu 14, Part 2

With squirrel-sql started as given in the last post we will create a H2 Database with the following via the GUI:

1. Click on the 'Drivers' side tab
2. Select H2 Embedded
3. Click on the Extra Class Path tab
4. Click on Add
5. Select the H2 driver jar that you extracted from ~/H2.
6. Click OK

To make a new Alias to the database, do the following

1. In the Name field enter 'testdb'
2. In the Driver drop down select H2 Embedded
3. For the User and Password fields enter 'sa' and leave the password blank
4. *For the URL field replace <db-name> with /home/<your home dir>/var

* before doing this go to the terminal and type

mkdir -p ~/var/

Now, here's the kicker. There are easy ways to load text formatted data in the H2 database it gets a little more difficult if you're working with binary files. You could write a converter in Java but the advantage of using Python is that you can use a variety of Python libraries to interface with scientific formats and dump them into the database. Our goal is efficiently create a table and dump data into it to the degree possible (I'm just one guy, not a professional sys admin).

There are several options for doing this: A) Install a PostGreSQL driver along with a PostGre Python module [1], B) Use the H2 TCP server to dump data over a socket and C) Attempt to interact with the H2 Shell via subprocess.

At first glance, option A provides compiled library access to the H2 database which should be the fastest option. Option B requires extra sockets and overhead from socket teardown/set up. It is probably the slowest option. Option C is a hack but avoids the socket overhead and might work with H2 Embedded directly (H2 Embedded is really fast), I'm not sure. Reasons for using H2 include unbounded number of rows/columns, parallel access and being a lightweight database. Because H2 is written in Java I can't pass it as an ODBC driver (as far as I know).

So, my next entry will cover what I come up with for accessing the H2 database. If options A, B and C are too complex I will investigate alternatives.

[1] https://wiki.postgresql.org/wiki/Python

Sunday, October 5, 2014

Setting up H2 (MySQL alternative) database for Ubuntu 14, Part I

Often times in exploratory data analysis and when developing estimation models it makes sense to work with sampled subsets. Assuming the sample is a good representative of the population it can allow for a processing speedup relative to your local processing power (running a ML algorithm on 20 Mb of data is a lot faster than 20 Gb). The sampling insights may even inform how your estimator samples at scale and/or in real time.

Data wrangling can be tedious even in R or Pandas and aggregation can be prone to errors. For this brief two part (or maybe 4 parts, if I take a large value of two) series I will take after Win-Vector's medium scale data technique to demonstrate how to set up an embedded H2 database that is manipulated via the simple Python ODBC library (instead of SQL Screwdriver). This is done to script the creation of a fast, local and efficient database containing a representative sample of a much larger data set.

I'm planning on using this technique against a medium scale data set of human brain MEG scans (see Kaggle's Decoding the Human Brain) and verify results in the current literature (namely that our responses to human faces are detectable in a few cortical regions at 100 ms and 170 ms of time after initially seeing a face). If I can reproduce the results and if I have time I will attempt to construct a simple estimator upon the results.

Okay. So,

1. We start by installing squirrel-sql; instructions are provided at the link. In Ubuntu, make sure you have a version of java installed that has gui libraries (it appears to install the headless version by default.) You can do,

# Ubuntu jdk/jre (make sure not headless)
sudo apt-get install openjdk-7-jdk
java -jar squirrel-sql-<version>-install.jar
# install to ~/squirrel-sql
cd ~/squirrel-sql; ./squirrel-sql

2. Get the H2 driver. It supports parallel access and an unbounded number of columns and rows. Download the latest,

wget -P /tmp/H2 http://www.h2database.com/h2-2014-08-06.zip
mkdir ~/H2
cd ~/H2
unzip -p /tmp/H2/h2-2014-08-06.zip h2/bin/h2-1.4.181.jar > h2-1.4.181.jar

In the next post I'll demonstrate how to install ODBC drivers, create a new H2 database using Squirrel-SQL and how to use Python to populate the database (table loading). I use Python over other alternatives because it allows us to easily import a wide variety of binary scientific file formats and dump them into an H2 table.

Saturday, October 4, 2014

mortar local:configure on stock Ubuntu Trusty

I'm using Ubuntu 14 at home and while going through the process of setting up a MortarData local install I found this to be helpful (enter at the shell):

# python3 is installed by default and includes a virtualenv implementation, pyvenv
# Ubuntu does not install javac by default
sudo apt-get install openjdk-7-jdk
sudo apt-get update
sudo apt-get install openjdk-7-jdk # second run installs javac

export JAVA_HOME=/bin

cd ~
pyvenv-3.4 py34
source py34/bin/activate
pip install virtualenv

mortar local:configure # everything should proceed without error