Three Reasons a Data Engineer Should Learn Scala

This article was written in collaboration with Hakka Labs (original link)

There has been a lot of debate over Scala lately, including criticisms like this, this, this, and defenses like this and this. Most of the criticisms seem to focus on the language’s complexity, performance, and integration with existing tools and libraries, while some praise its elegant syntax, powerful type system, and good fit for domain-specific languages.

However most of the discussions seem based on experiences building production backend or web systems where there are a lot of other options already. There are mature, battle tested options like Java, Erlang or even PHP, and there are Go, node.js, or Python for those who are more adventurous or prefer agility over performance.

Here I want to argue that there’s a best tool for every job, and Scala shines for data processing and machine learning, for the following reasons:

  • good balance between productivity and performance
  • integration with big data ecosystem
  • functional paradigm

Productivity without sacrificing performance

In the big data & machine learning world where most developers are from Python/R/Matlab background, Scala’s syntax, or the subset needed for the domain, is a lot less intimidating than that …

more ...

Scala Workshop

While there are many Scala tutorials and books available, very few of them focus on big data. I did a couple of workshops at Spotify focusing on these areas and here are the slides.

more ...

How many copies

One topic that came up a lot when optimizing Scala data applications is the performance of standard collections, or the hidden cost of temporary copies. The collections API is easy to learn and maps well to many Python concepts where a lot of data engineers are familiar with. But the performance penalty can be pretty big when it’s repeated over millions of records in a JVM with limited heap.

Mapping values

Let’s take a look at one most naive example first, mapping the values of a Map.

val m = Map("A" -> 1, "B" -> 2, "C" -> 3)
m.toList.map(t => (t._1, t._2 + 1)).toMap

Looks simple enough but obviously not optimal. Two temporary List[(String, Int)] were created, one from toList and one from map. map also creates 3 copies of (String, Int).

There are a few commonly seen variations. These don’t create temporary collections but still key-value tuples.

for ((k, v) <- m) yield k -> (v + 1)
m.map { case (k, v) => k -> (v + 1) }

If one reads the ScalaDoc closely, there’s a mapValues method already and it probably is the shortest and most performant.

m.mapValues(_ + 1)

Java conversion

Similar problem exists …

more ...