Scio at Philly ETE

It’s been another 6 months since my talk about Scio at Scala by the Bay. We’ve seen huge adoption and improvements since then. The number of production Scio pipelines has grown from ~70 to 400+ within Spotify. A lot of other companies are using and contributing to it as well. In the most recent edition of the Spotify data university, an internal week long big data training camp for non-data engineers, we revamped the curriculum to cover Scio, BigQuery and other Google Cloud Big Data products instead of Hadoop, Scalding and Hive.

Spotify data university round 3 & 1st time covering Scio, @ApacheBeam & @GCPBigData 👋 Hadoop, HDFS, M/R, YARN 🍾 batch + streaming pic.twitter.com/1gWIEbN0mW
— Neville Li (@sinisa_lyh) March 28, 2017

And here’s a list of some notable improvements in Scio.

Master branch is now based on Apache Beam
Graduate type safe BigQuery API form experimental to stable
Sparkey side input support
TensorFlow TFRecord file IO
Cloud Pub/Sub attributes support
Named transformations for streaming update
Safe-guard against malformed tests and better error messages
Flexible custom IO wiring
KryoRegistrar for custom Kryo serialization
Table description for type-safe BigQuery
Lots of performance improvements and bug fixes

I talked about Scio at Philly ETE last week and here are the slides.

Scio - Moving to Google Cloud, A Spotify Story from Neville Li

Comments