Stackdiver as a Service

For comprehensions

I made these slides on for comprehension a while ago but forgot to post them but here you go.

more ...

It’s been 7 months since we first announced Scio at GCPNEXT16. There’re now dozens of internal teams and a couple of other companies using Scio to run hundreds of pipelines on a daily basis. Within Spotify, Scio is now the prefered framework for building new data pipelines on Google Cloud Platform. We’ve also made 19 released and added tons of features and improvements. Below is a list of some notable ones.

Interactive REPL
Type safe BigQuery macro improvements and Scio-IDEA-plugin
BigQuery standard SQL 2011 syntax support
HDFS source and sink
Avro file compression support
Bigtable multi-table sink and utility for cluster scaling
Protobuf file support and usability improvements
Accumulator usability improvements
End-to-end testing utilities and matchers improvements
Join performance improvements and skewed join
Metrics interface and enhancements

I talked about Scio at Scala by the Bay last week and here are the slides.

Scio - A Scala API for Google Cloud Dataflow & Apache Beam from Neville Li

more ...

Semigroups

It’s been a while since my last post. Both Scio and Scala have been picking up a lot of momentum within Spotify and a lot of people are starting to leveraging the power of Algebird. So here’s another mini-talk on Semigroups.

more ...

Scio, a Scala API for Google Cloud Dataflow

We recently open sourced Scio, a Scala API for Google Cloud Dataflow. Here are the slides of our talk at GCPNEXT16 a few weeks ago.

From stream to recommendation using apache beam with cloud pubsub and cloud dataflow from Neville Li

The first half of the talk covers our experiments with Dataflow and Pub/Sub for streaming application while the second half covers Scio and BigQuery for batch analysis and machine learning.

more ...

Scala Data Pipelines @ Spotify

It’s been a while since I last updated this blog. I did a talk at Big Data Scala this summer and here are the slides

Scala Data Pipelines @ Spotify from Neville Li

more ...

Primitives

Another day, another talk. This one is on primitives and here are the slides

more ...

Type Classes

I started giving more Scala talks internally at Spotify. The first one is on type classes and here are the slides

more ...

NEScala 2015 talk

I gave a lightning talk at Northeast Scala Symposium last week on Macros in Datapipelines and here are the slides.

more ...

Fun with macros and parquet-avro

I recently had some fun building parquet-avro-extra, an add-on module for parquet-avro using Scala macros. I did it mainly to learn Scala macros but also to make it easier to use Parquet with Avro in a data pipeline.

Parquet and Avro

Parquet is a columnar storage system designed for HDFS. It offers some nice improvements over row-major systems including better compression and less I/O with column projection and predicate pushdown. Avro is a data serialization system that enables type-safe access to structured data with complex schema. The parquet-avro module makes it possible to store data in Parquet format on disk and process them as Avro objects inside a JVM data pipeline like Scalding or Spark.

Projection

Parquet allows reading only a subset of columns via projection. Here’s an Scalding example from Tapad.

Projection[Signal]("field1", "field2.field2a")

Note that fields specifications are strings even though the API has access to Avro type Signal which has strongly typed getter methods.

This is slightly counter-intuitive since most Scala developers are used to transformations like pipe.map(_.getField). It’s however can be easily solved with macro since the syntax tree of is accessible. A modified version has signature of …

more ...

Three Reasons a Data Engineer Should Learn Scala

This article was written in collaboration with Hakka Labs (original link)

There has been a lot of debate over Scala lately, including criticisms like this, this, this, and defenses like this and this. Most of the criticisms seem to focus on the language’s complexity, performance, and integration with existing tools and libraries, while some praise its elegant syntax, powerful type system, and good fit for domain-specific languages.

However most of the discussions seem based on experiences building production backend or web systems where there are a lot of other options already. There are mature, battle tested options like Java, Erlang or even PHP, and there are Go, node.js, or Python for those who are more adventurous or prefer agility over performance.

Here I want to argue that there’s a best tool for every job, and Scala shines for data processing and machine learning, for the following reasons:

good balance between productivity and performance
integration with big data ecosystem
functional paradigm

Productivity without sacrificing performance

In the big data & machine learning world where most developers are from Python/R/Matlab background, Scala’s syntax, or the subset needed for the domain, is a lot less intimidating than that …

more ...