How many copies

One topic that came up a lot when optimizing Scala data applications is the performance of standard collections, or the hidden cost of temporary copies. The collections API is easy to learn and maps well to many Python concepts where a lot of data engineers are familiar with. But the performance penalty can be pretty big when it’s repeated over millions of records in a JVM with limited heap.

Mapping values

Let’s take a look at one most naive example first, mapping the values of a Map.

val m = Map("A" -> 1, "B" -> 2, "C" -> 3)
m.toList.map(t => (t._1, t._2 + 1)).toMap

Looks simple enough but obviously not optimal. Two temporary List[(String, Int)] were created, one from toList and one from map. map also creates 3 copies of (String, Int).

There are a few commonly seen variations. These don’t create temporary collections but still key-value tuples.

for ((k, v) <- m) yield k -> (v + 1)
m.map { case (k, v) => k -> (v + 1) }

If one reads the ScalaDoc closely, there’s a mapValues method already and it probably is the shortest and most performant.

m.mapValues(_ + 1)

Java conversion

Similar problem exists …

more ...