Scio, a Scala API for Google Cloud Dataflow

We recently open sourced Scio, a Scala API for Google Cloud Dataflow. Here are the slides of our talk at GCPNEXT16 a few weeks ago.

The first half of the talk covers our experiments with Dataflow and Pub/Sub for streaming application while the second half covers Scio and BigQuery for batch analysis and machine learning.

more ...





Fun with macros and parquet-avro

I recently had some fun building parquet-avro-extra, an add-on module for parquet-avro using Scala macros. I did it mainly to learn Scala macros but also to make it easier to use Parquet with Avro in a data pipeline.

Parquet and Avro

Parquet is a columnar storage system designed for HDFS. It offers some nice improvements over row-major systems including better compression and less I/O with column projection and predicate pushdown. Avro is a data serialization system that enables type-safe access to structured data with complex schema. The parquet-avro module makes it possible to store data in Parquet format on disk and process them as Avro objects inside a JVM data pipeline like Scalding or Spark.

Projection

Parquet allows reading only a subset of columns via projection. Here’s an Scalding example from Tapad.

Projection[Signal]("field1", "field2.field2a")

Note that fields specifications are strings even though the API has access to Avro type Signal which has strongly typed getter methods.

This is slightly counter-intuitive since most Scala developers are used to transformations like pipe.map(_.getField). It’s however can be easily solved with macro since the syntax tree of is accessible. A modified version has signature of …

more ...

Three Reasons a Data Engineer Should Learn Scala

This article was written in collaboration with Hakka Labs (original link)

There has been a lot of debate over Scala lately, including criticisms like this, this, this, and defenses like this and this. Most of the criticisms seem to focus on the language’s complexity, performance, and integration with existing tools and libraries, while some praise its elegant syntax, powerful type system, and good fit for domain-specific languages.

However most of the discussions seem based on experiences building production backend or web systems where there are a lot of other options already. There are mature, battle tested options like Java, Erlang or even PHP, and there are Go, node.js, or Python for those who are more adventurous or prefer agility over performance.

Here I want to argue that there’s a best tool for every job, and Scala shines for data processing and machine learning, for the following reasons:

  • good balance between productivity and performance
  • integration with big data ecosystem
  • functional paradigm

Productivity without sacrificing performance

In the big data & machine learning world where most developers are from Python/R/Matlab background, Scala’s syntax, or the subset needed for the domain, is a lot less intimidating than that …

more ...

Scala Workshop

While there are many Scala tutorials and books available, very few of them focus on big data. I did a couple of workshops at Spotify focusing on these areas and here are the slides.

more ...

Using CQL with legacy column families

We use Cassandra extensively at work, and up till recently we’ve been using mostly Cassandra 1.2 with Astyanax and Thrift protocol in Java applications. Very recently we started adopting Cassandra 2.0 with CQL, DataStax Java Driver and binary protocol.

While one should move to CQL schema to take full advantage of the new protocol and storage engine, it’s still possible to use CQL and the new driver on existing clusters. Say we have a legacy column family with UTF8Type for row/column keys and BytesType for values, it would look like this in cassandra-cli:

create column family data
  with column_type = 'Standard'
  and comparator = 'UTF8Type'
  and default_validation_class = 'BytesType'
  and key_validation_class = 'UTF8Type';

And this in cqlsh after setting start_native_transport: true in cassandra.yaml:

CREATE TABLE data (
  key text,
  column1 text,
  value blob,
  PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE;

In this table, key and column1 corresponds to row and column keys in the legacy column family and value corresponds to column value.

Queries to look up a column value, an entire row, and selected columns in a row would look like this:

SELECT value FROM mykeyspace.data WHERE key = 'rowkey' AND column1 = 'colkey';
SELECT column1, value FROM mykeyspace …
more ...

dotfiles update

I’ve been using my current dotfiles setup for a while and felt it’s time to freshen up. I focused on updating the look and feel of Vim and tmux this round.

First I switched to molokai color theme for Vim, TextMate (monokai) and IntelliJ IDEA (using this). I guess I grew tired of the old trusted solarized, plus with my new MacBook Pro 13” at highest resolution, it just doesn’t feel sharp enough.

The vim-powerline plugin I was using is being deprecated and replaced by powerline, which supports vim, tmux, zsh, and many others. However it requires Python and I had trouble using it with some really old Vim versions at work. So instead I switched to a pure VimL plugin, vim-airline. Not surprisingly there’s a companion plugin, tmuxline for tmux as well. Both have no extra dependencies which is a big plus for me since I use the same dotfiles on Mac, my Ubuntu Trusty destop at work, and many Debian Squeeze servers.

I also updated a couple of other Vim plugins along the process, replacing vim-snipmate with ultisnips, vim-bad-whitespace with vim-better-whitespace (no pun intended), and adding vim-gutter. The biggest discovery is vim-easymotion though, perfect …

more ...