Sunday, October 5, 2014

Setting up H2 (MySQL alternative) database for Ubuntu 14, Part I

Often times in exploratory data analysis and when developing estimation models it makes sense to work with sampled subsets. Assuming the sample is a good representative of the population it can allow for a processing speedup relative to your local processing power (running a ML algorithm on 20 Mb of data is a lot faster than 20 Gb). The sampling insights may even inform how your estimator samples at scale and/or in real time.

Data wrangling can be tedious even in R or Pandas and aggregation can be prone to errors. For this brief two part (or maybe 4 parts, if I take a large value of  two) series I will take after Win-Vector's medium scale data technique to demonstrate how to set up an embedded H2 database that is manipulated via the simple Python ODBC library (instead of SQL Screwdriver). This is done to script the creation of a fast, local and efficient database containing a representative sample of a much larger data set.

I'm planning on using this technique against a medium scale data set of human brain MEG scans (see Kaggle's Decoding the Human Brain) and verify results in the current literature (namely that our responses to human faces are detectable in a few cortical regions at 100 ms and 170 ms of time after initially seeing a face). If I can reproduce the results and if I have time I will attempt to construct a simple estimator upon the results.

Okay. So,

1. We start by installing squirrel-sql; instructions are provided at the link. In Ubuntu, make sure you have a version of java installed that has gui libraries (it appears to install the headless version by default.) You can do,
# Ubuntu jdk/jre (make sure not headless)
sudo apt-get install openjdk-7-jdk
java -jar squirrel-sql-<version>-install.jar
# install to ~/squirrel-sql
cd ~/squirrel-sql; ./squirrel-sql

2. Get the H2 driver. It supports parallel access and an unbounded number of columns and rows. Download the latest,

wget -P /tmp/H2 http://www.h2database.com/h2-2014-08-06.zip
mkdir ~/H2
cd ~/H2
unzip -p /tmp/H2/h2-2014-08-06.zip h2/bin/h2-1.4.181.jar > h2-1.4.181.jar

In the next post I'll demonstrate how to install ODBC drivers, create a new H2 database using Squirrel-SQL and how to use Python to populate the database (table loading). I use Python over other alternatives because it allows us to easily import a wide variety of binary scientific file formats and dump them into an H2 table.

No comments:

Post a Comment