Scala Workshop

While there are many Scala tutorials and books available, very few of them focus on big data. I did a couple of workshops at Spotify focusing on these areas and here are the slides.

more ...

On being a polyglot

I’m kind of known as a polyglot among coworkers. We would often argue that instead of hiring great Java/Python/C++ developers, we should rather strive to hire great engineers with strong CS fundamentals who can pick up any language easily. I came from scientific computing background, doing mostly C/C++/Python many years ago. Over the course of the last three years at my current job I coded seven languages professionally, some out of interest and some necessity. I enjoyed the experience learning all these different things and want to share my experience here, what I learned from each one of them and how it helps me becoming a better engineer.

C

The first language I used seriously, apart from LOGO & BASIC when I was a kid of course. It’s probably the closest thing one can get to the operating system and bare metal without dropping down to assembly (while you still can in C). It’s a simple language whose syntax served as the basis of many successors like C++ & Java. It doesn’t offer any fancy features like OOP or namespaces, but rather depends on the developer’s skill for organizing large code base (think …

more ...

How many copies

One topic that came up a lot when optimizing Scala data applications is the performance of standard collections, or the hidden cost of temporary copies. The collections API is easy to learn and maps well to many Python concepts where a lot of data engineers are familiar with. But the performance penalty can be pretty big when it’s repeated over millions of records in a JVM with limited heap.

Mapping values

Let’s take a look at one most naive example first, mapping the values of a Map.

val m = Map("A" -> 1, "B" -> 2, "C" -> 3)
m.toList.map(t => (t._1, t._2 + 1)).toMap

Looks simple enough but obviously not optimal. Two temporary List[(String, Int)] were created, one from toList and one from map. map also creates 3 copies of (String, Int).

There are a few commonly seen variations. These don’t create temporary collections but still key-value tuples.

for ((k, v) <- m) yield k -> (v + 1)
m.map { case (k, v) => k -> (v + 1) }

If one reads the ScalaDoc closely, there’s a mapValues method already and it probably is the shortest and most performant.

m.mapValues(_ + 1)

Java conversion

Similar problem exists …

more ...