Saturday, October 18, 2014

Efficient conversion of Python readable binary file format to Avro (Pig, Avro, Python)

This entry covers converting a binary format to the Pig Avro file format for follow on transformation and analysis. As mentioned in prior entries on using a local databases (H2, SQL transformations), this is can be efficient because a subsample is taken and easily transformed using Pig's data flow language for local analysis. Pig -x local mode is used.

Here is an example where we make up a binary format from within Python, write it out to Avro and do extract transform load (ETL) on it from PIG. The result can be written out as a of set files in a target binary format using a Python UDF (i.e., a write a set of Matlab arrays). Without further ado,

pip install avro-python3

Then create test_avro_write.py

#!/usr/env/bin python
# Adopted after http://www.harshj.com/2010/04/25/writing-and-reading-avro-data-files-using-python/
from random import randint
from avro import schema, datafile, io

OUTFILE_NAME = 'mydata.avro'

# written to disk
SCHEMA_STR = """{
    "type": "record",
    "name": "data",
    "namespace": "AVRO",
    "fields": [
        {   "name": "name"   , "type": "string"   },
        {   "name": "age"    , "type": "int"      },
        {   "name": "address", "type": "string"   },
        {   "name": "value"  , "type": "long"     }
    ]
            }"""

SCHEMA = schema.Parse(SCHEMA_STR)

def write_avro_file(OUTFILE_NAME):
    # Lets generate our data
    data = {}
    data['name']    = ''
    data['age']     = 0
    data['address'] = '10, Bar Eggs Spam'
    data['value']   = 0

    # Create a 'record' (datum) writer
    rec_writer = io.DatumWriter(SCHEMA)

    # Create a 'data file' (avro file) writer
    df_writer = datafile.DataFileWriter(
                    open(OUTFILE_NAME, 'wb'),
                    rec_writer,
                    writer_schema = SCHEMA,
                    codec = 'deflate'
                )

    # Write our data, made up binary format
    for char in range(45):
        data['name'] = chr(char) + '_foo'
        data['age'] = randint(13,99)
        data['value'] = randint(47,800)
        df_writer.append(data)

    # Close to ensure writing is complete
    df_writer.close()

if __name__ == '__main__':
    # Write an AVRO file first
    write_avro_file(OUTFILE_NAME)
Once test_avro_write.py is run you'll have an Avro file with randomized data. Use Pig 0.12+ to do some basic ETL on the data.

data = LOAD 'mydata.avro' USING AvroStorage();

age_filter = FILTER data BY ((age < 50) and (value > 300));
age_filter = FOREACH age_filter GENERATE (name, address);

age_group = GROUP age_filter ALL;
age_count = FOREACH age_group GENERATE COUNT(age_filter);
DUMP age_count; -- Replace DUMP with Python UDF to target binary format

While this example is quite trival it shows where a Python library can be used to import (or generate) binary data, how it can be efficiently written to disk (as opposed to .csv), transformed using Pig. Using a Pig Python UDF to write out to a target binary format allows the transformed data to be analyzed in any package of choice.

Note: To dump an array to matlab format, see scipy.io.savemat()

No comments:

Post a Comment