Fun with macros and parquet-avro

I recently had some fun building parquet-avro-extra, an add-on module for parquet-avro using Scala macros. I did it mainly to learn Scala macros but also to make it easier to use Parquet with Avro in a data pipeline.

Parquet and Avro

Parquet is a columnar storage system designed for HDFS. It offers some nice improvements over row-major systems including better compression and less I/O with column projection and predicate pushdown. Avro is a data serialization system that enables type-safe access to structured data with complex schema. The parquet-avro module makes it possible to store data in Parquet format on disk and process them as Avro objects inside a JVM data pipeline like Scalding or Spark.

Projection

Parquet allows reading only a subset of columns via projection. Here’s an Scalding example from Tapad.

Projection[Signal]("field1", "field2.field2a")

Note that fields specifications are strings even though the API has access to Avro type Signal which has strongly typed getter methods.

This is slightly counter-intuitive since most Scala developers are used to transformations like pipe.map(_.getField). It’s however can be easily solved with macro since the syntax tree of is accessible. A modified version has signature of …

more ...