Here is an example where we make up a binary format from within Python, write it out to Avro and do extract transform load (ETL) on it from PIG. The result can be written out as a of set files in a target binary format using a Python UDF (i.e., a write a set of Matlab arrays). Without further ado,
pip install avro-python3
Then create test_avro_write.py
#!/usr/env/bin pythonOnce test_avro_write.py is run you'll have an Avro file with randomized data. Use Pig 0.12+ to do some basic ETL on the data.
# Adopted after http://www.harshj.com/2010/04/25/writing-and-reading-avro-data-files-using-python/
from random import randint
from avro import schema, datafile, io
OUTFILE_NAME = 'mydata.avro'
# written to disk
SCHEMA_STR = """{
"type": "record",
"name": "data",
"namespace": "AVRO",
"fields": [
{ "name": "name" , "type": "string" },
{ "name": "age" , "type": "int" },
{ "name": "address", "type": "string" },
{ "name": "value" , "type": "long" }
]
}"""
SCHEMA = schema.Parse(SCHEMA_STR)
def write_avro_file(OUTFILE_NAME):
# Lets generate our data
data = {}
data['name'] = ''
data['age'] = 0
data['address'] = '10, Bar Eggs Spam'
data['value'] = 0
# Create a 'record' (datum) writer
rec_writer = io.DatumWriter(SCHEMA)
# Create a 'data file' (avro file) writer
df_writer = datafile.DataFileWriter(
open(OUTFILE_NAME, 'wb'),
rec_writer,
writer_schema = SCHEMA,
codec = 'deflate'
)
# Write our data, made up binary format
for char in range(45):
data['name'] = chr(char) + '_foo'
data['age'] = randint(13,99)
data['value'] = randint(47,800)
df_writer.append(data)
# Close to ensure writing is complete
df_writer.close()
if __name__ == '__main__':
# Write an AVRO file first
write_avro_file(OUTFILE_NAME)
data = LOAD 'mydata.avro' USING AvroStorage();
age_filter = FILTER data BY ((age < 50) and (value > 300));
age_filter = FOREACH age_filter GENERATE (name, address);
age_group = GROUP age_filter ALL;
age_count = FOREACH age_group GENERATE COUNT(age_filter);
DUMP age_count; -- Replace DUMP with Python UDF to target binary format
While this example is quite trival it shows where a Python library can be used to import (or generate) binary data, how it can be efficiently written to disk (as opposed to .csv), transformed using Pig. Using a Pig Python UDF to write out to a target binary format allows the transformed data to be analyzed in any package of choice.
Note: To dump an array to matlab format, see scipy.io.savemat()