The Probabilistic Traveler

Friday, July 10, 2015

Some socket/application debugging tips (Linux, Python)

This is more of a note than a blog post. Recently, I was engaged in debugging a web application that used Flask (a python web framework) and Moses, a statistical machine translation program from academia. At one point the front end, a GUI built on Flask, stopped responding and I didn't want to stop the entire process.

To see what sockets were active,

sudo netstat -ltnp

From there, with the pid in hand,

sudo strace -s 3000 -f -p <PID>

Searching for socket related system calls I was able to get a sense of what was going on. If just wanted to look at network traffic then -e trace=network would have been appropriate.

In this case my application was stuck on recv(6, , ,, where a socket was established but not receiving any information. Since I was using the basic Flask webserver things were single threaded and the app could not return to serving static pages again. The solution here is to shutdown the hanging socket and let the app serve pages as normal...

The socket was shut down with gdb:

#sudo apt-get install python3-dbg gdb
sudo gdb python3 <PID>

With the symbol table loaded we can use the shutdown command with,

call shutdown(6, 2) # safely shutsdown socket, even for multithreaded applications

Friday, May 22, 2015

Connecting to your google compute instance (Google Compute, Google Cloud, SSH)

I signed up for a Google Cloud account today. I have a $300 credit or 60 days, which ever is exhausted first.

I know Amazon AWS can use Docker but I decided to go with Google Cloud because their management options for VMs seemed more complete and less shoe horned into their current offerings. In the end, I was looking for an easily manageable platform for hosting applied research backends and frontends and Google Compute looks like a great start.

So, in the process of hosting a novel Bitcoin related translation system I realized I first needed to push a custom binary to my instance. First we'll make sure ssh is set up and then we will use file-copy to push data to the instance. After some trouble shooting these are the steps I came up with:

0) Go to the google compute console. Create a project and an instance. Take note of the automatically generated project-ID (this is different than the project name) and is under the Overview link under the Developers Console)
1) Click on API -> Enabled APIs (in the right hand frame), make sure Google Compute is enabled
2) Click on Google Compute -> Metadata, (right frame) SSH keys. Remove any SSH keys that are there (unless you can already ssh in and don't need to read this entry): Edit, click the X next to the key. This forces everything to start fresh.
3) Open up your favorite terminal,
3a) Install the gcloud SDK here
3b) gcloud auth login # to authenticate your host
3c) rm ~/.ssh/google-compute*; gcloud compute instances list # identify the instance name you want to push to
3d) gcloud config set project <project-ID>
3e) gcloud compute ssh <instance-name> # this will create a new ssh key pair (then exit the ssh session)
3f) gcloud compute copy-files <your file path> <project name>:<remote path> --zone <your zone, refer to instance list above>

Your upload should complete and you can ssh back in to verify that the data exists.

Monday, February 9, 2015

Python (virtualenv), installing OpenCV, Ubuntu 14.04

I recently wanted to install OpenCV with Python 3.4 bindings on my Ubuntu 14.04 system.

I ran into several errors: Can not find Python.h even though python3-dev was installed. And, prior to that, cmake was not picking up the python libraries path.

Through some extensive googling I discovered that cmake 2.8.10 has a known bug with correctly finding the python includes directory. Refer to this bug report.

This issue was resolved by updating Ubuntu's version of cmake (which was ancient) and judicious use of opencv compilation flags. Specifically, I installed a newer cmake with:

wget http://www.cmake.org/files/v3.1/cmake-3.1.2.tar.gz
tar -zxvf cmake-3.1.2.tar.gz
cd cmake-3.1.2 mkdir _build
cd _build
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr
make
sudo make install
sudo ldconfig

and used the following opencv flags to build opencv for my python3 virtual env:

cmake -DBUILD_TESTS=OFF -DBUILD_PERF_TESTS=OFF -DBUILD_opencv_python2=OFF -DBUILD_opencv_python3=ON -DPYTHON3_EXECUTABLE=/home/kwame/py34/bin/python -DPYTHON_INCLUDE_DIRS=/usr/include/python3.4m -DPYTHON3_LIBRARY=/home/kwame/py34/lib/python3.4/config-3.4m-x86_64-linux-gnu/libpython3.4m.so -DPYTHON3_NUMPY_INCLUDE_DIRS=build/numpy-1.9.1/numpy/core/include/ -DPYTHON3_PACKAGES_PATH=/home/kwame/py34/lib/python3.4/site-packages -DWITH_TBB=ON ..
make -j8
sudo make install
sudo ldconfig

I leave off the tests because they are statically linked and bloat the library sizes. I only build opencv for python 3.

The opencv installation was verified as working within Python3 by the following:

ipython
import cv

Thursday, December 11, 2014

Brief Python Nose/Mock Snipppet

I've been using Nose and Mock recently for unit and integration testing. It's been quite fun.

Remember that @patch sets the namespace of the module you provide so that the test function finds the @patch setted object first. This allows you to test with Mock objects and is called monkey patching.

Here's a quick example,

run with debugging by typing,

python workter_test.py --pdb

Saturday, December 6, 2014

Matplotlib, TkAgg backend for interactive plotting under virtualenv, ipython

The easiest way I've found to use matplotlib and Tk for inline plotting under ipython is to actually install it on the base python3 environment with apt-get and use the --system-site-packages flag when creating the new virtualenv. Other techniques require more work.

apt-get is able to build and install these packages while pip install, under a virtualenv, will not see the interactive (Tk) libraries when it builds matplotlib.

So, for the base environment,

deactivate # make sure you're in the base environment
sudo apt-get install python3-tk tk tk-dev
sudo apt-get install python3-matplotlib
python3
import tkinter
import matplotlib
matplotlib.use('agg') # default non interactive shell
matplotlib.use('TkAgg') # interactive plotting backend
quit()

This shows you that you can set an interactive backend. Now create a new virtualenv with,

virtualenv -p /usr/bin/python3.4 --system-site-packages ~/mypy34
source ~/mypy34/bin/activate # enter the new virtualenv
pip install pyzmq # for ipython
pip install ipython

Now in ipython it will pick up your TkAgg backend by default (or else set it directly with %matplotlib TkAgg) and you can plot as expected from within a virtualenv.

Saturday, October 18, 2014

Efficient conversion of Python readable binary file format to Avro (Pig, Avro, Python)

This entry covers converting a binary format to the Pig Avro file format for follow on transformation and analysis. As mentioned in prior entries on using a local databases (H2, SQL transformations), this is can be efficient because a subsample is taken and easily transformed using Pig's data flow language for local analysis. Pig -x local mode is used.

Here is an example where we make up a binary format from within Python, write it out to Avro and do extract transform load (ETL) on it from PIG. The result can be written out as a of set files in a target binary format using a Python UDF (i.e., a write a set of Matlab arrays). Without further ado,

pip install avro-python3

Then create test_avro_write.py

#!/usr/env/bin python
# Adopted after http://www.harshj.com/2010/04/25/writing-and-reading-avro-data-files-using-python/
from random import randint
from avro import schema, datafile, io

OUTFILE_NAME = 'mydata.avro'

# written to disk
SCHEMA_STR = """{
    "type": "record",
    "name": "data",
    "namespace": "AVRO",
    "fields": [
        {   "name": "name"   , "type": "string"   },
        {   "name": "age"    , "type": "int"      },
        {   "name": "address", "type": "string"   },
        {   "name": "value" , "type": "long"     }
    ]
            }"""

SCHEMA = schema.Parse(SCHEMA_STR)

def write_avro_file(OUTFILE_NAME):
    # Lets generate our data
    data = {}
    data['name']    = ''
    data['age']     = 0
    data['address'] = '10, Bar Eggs Spam'
    data['value']   = 0

    # Create a 'record' (datum) writer
    rec_writer = io.DatumWriter(SCHEMA)

    # Create a 'data file' (avro file) writer
    df_writer = datafile.DataFileWriter(
                    open(OUTFILE_NAME, 'wb'),
                    rec_writer,
                    writer_schema = SCHEMA,
                    codec = 'deflate'
                )

    # Write our data, made up binary format
    for char in range(45):
        data['name'] = chr(char) + '_foo'
        data['age'] = randint(13,99)
        data['value'] = randint(47,800)
        df_writer.append(data)

    # Close to ensure writing is complete
    df_writer.close()

if __name__ == '__main__':
    # Write an AVRO file first
    write_avro_file(OUTFILE_NAME)

Once test_avro_write.py is run you'll have an Avro file with randomized data. Use Pig 0.12+ to do some basic ETL on the data.

data = LOAD 'mydata.avro' USING AvroStorage();

age_filter = FILTER data BY ((age < 50) and (value > 300));
age_filter = FOREACH age_filter GENERATE (name, address);

age_group = GROUP age_filter ALL;
age_count = FOREACH age_group GENERATE COUNT(age_filter);
DUMP age_count; -- Replace DUMP with Python UDF to target binary format

While this example is quite trival it shows where a Python library can be used to import (or generate) binary data, how it can be efficiently written to disk (as opposed to .csv), transformed using Pig. Using a Pig Python UDF to write out to a target binary format allows the transformed data to be analyzed in any package of choice.

Note: To dump an array to matlab format, see scipy.io.savemat()

Wednesday, October 8, 2014

Setting up H2 (MySQL alternative) database for Ubuntu 14, Part 4

EDIT - Note there is now a Python3 version of JayDeBeApit at https://pypi.python.org/pypi/JayDeBeApi3. So the below is no longer an issue.

I was able to get Python to connect to the H2 database and insert a couple of rows. There is more troubleshooting to do on why curs.fetchall() doesn't return result. It could be that the insert statements weren't committed.

To get to where I got, do the following. Note I will fork a Python 3 branch of JayDeBeApi so the modifications in this post will be irrelevant in the future.

# start up your Python 3 environment
echo "JayDeBeApi" > requirements.txt

pip install -d . -r requirements.txt
tar -zxvf JayDeBeApi-0.1.4.tar.gz
2to3 -f all -w JayDeBeApi-0.1.4

Make the following source changes;
# In setup.py:33, change "file" to "open"
# In dbapi2.py: 21, comment out "exceptions" # not needed
# In dbapi2.py:185, 188, remove "exceptions."
# In dbapi2.py: 382, remove "next", should just read "if not self._rs"

Then,

pip install -e JayDeBeApi-0.1.4
python
conn = jaydebeapi.connect('org.h2.Driver', ['jdbc:h2://home/your/path/var/h2demodb', 'user', 'pw'], '/home/kwame/H2/h2-2014-08-06.jar')
curs = conn.cursor()
curs.execute('CREATE TABLE MYTEST(ID INT PRIMARY KEY, NAME VARCHAR(255));')
curs.execute('INSERT INTO MYTEST VALUES(1, \'Hello World\');')
curs.execute('SELECT * FROM MYTEST ORDER BY ID;')
curs.fetchall() # SQLExceptionPyRaisable: org.h2.jdbc.JdbcSQLException: No data is available [2000-181]

I will follow up when I figure out why there is no data in MYTEST. The table persists between close(). EDIT - PyODBC turns auto-commit off by default

Once I have that I'll have the ability to take data from a variety of scientific packages and formats (Matlab, pcap sessions, etc.) and dump them into a H2 database for follow on munging/wrangling.