Macros in

Data Pipelines

Neville Li
Jan 2015
neville@spotify.com
@sinisa_lyh

About Me

  • Music recommendation @Spotify
  • Scala since early 2013
  • Scalding, Storm, Spark, etc.

Powerful data combo

  • Parquet - storage
  • Avro - representation
  • Scalding / Spark - pipeline

Schema

{
    "type": "record",
    "name": "Account",
    "namespace": "me.lyh.parquet.avro.schema",
    "fields": [
        {"name": "id", "type": "int"},
        {"name": "type", "type": "string"},
        {"name": "name", "type": "string"},
        {"name": "amount", "type": "float"}
    ]
}

Pipeline

ParquetAvroSource[Account]("input")
  .map(a => (a.getName, a.getAmount))
  .group
  .reduce(_+_)

Projection and predicate

  • Column projection - skip columns
  • Filter predicate - skip row(group)s
  • Massive speedup - often 10+ times

So what's the problem?

Projection

// native
pipe.map(a => (a.getName, a.getAmount))

// Parquet
Parquet.project[Account]("name", "amount")
  • STRINGS → unsafe and error prone
  • No IDE auto-complete → finger injury
  • my_fancy_field_name → .getMyFancyFieldName
  • Hard to migrate existing code

Predicate

// native
pipe.filter(a => a.getName == "Neville" && a.getAmount > 100)

// Parquet
FilterApi.and(
  FilterApi.eq(FilterApi.binaryColumn("name"),
               Binary.fromString("Neville")),
  FilterApi.gt(FilterApi.floatColumn("amount"),
               100f.asInstnacesOf[java.lang.Float])  // Java...
)

Like Clojure, but worse

Macros to the rescue

What's in an Expr

_.getAccounts.get(0).getAmount > 10

Internal

scala.this.Predef.Integer2int(x$1.getAccounts().get(0).getAmount()).>(10)

RAWRRR

Apply(Select(Apply(Select(Select(This(newTypeName("scala")), scala.Predef),
newTermName("Integer2int")), List(Apply(Select(Apply(Select(Apply(Select(
Ident(newTermName("x$1")), newTermName("getAccounts")), List()),
newTermName("get")), List(Literal(Constant(0)))), newTermName("getAmount")),
List()))), newTermName("$greater")), List(Literal(Constant(10))))

Don't worry, there's pattern matching and recursion

Projection improved

import org.apache.avro.Schema
import org.apache.avro.specific.{ SpecificRecord => SR }

object Projection {
  def apply[T <: SR](gs: (T => Any)*): Schema = macro applyImpl[T]
  def applyImpl[T <: SR : c.WeakTypeTag]
               (c: Context)(gs: c.Expr[(T => Any)]*): c.Expr[Schema] = {
    // ...
}
Projection[Accont](_.getName, _.getAmount)

Predicate improved

import _root_.parquet.filter2.predicate.FilterPredicate
import org.apache.avro.specific.{ SpecificRecord => SR }

object Predicate {
  def apply[T <: SR](p: T => Boolean): FilterPredicate = macro applyImpl[T]
  def applyImpl[T <: SR : c.WeakTypeTag]
               (c: Context)
               (p: c.Expr[T => Boolean]): c.Expr[FilterPredicate] = {
    // ...
}
Predicate[Accont](x => x.getName == "Neville" && x.getAmount > 100)

Things that the complier does

that I have to mimic

  • Flipped (a > 10) === (10 < a)
  • Primitive vs boxed values (and NULLs!)
  • Numeric type coercion
  • Booleans
    (_.getBool) === (_.getBool == true)

Code

https://github.com/nevillelyh/parquet-avro-extra

In production @Spotify

The End

Thank You