Semigroups

# Advanced Scala
## Semigroups
Neville Li

@sinisa_lyh

Sep 2016

---

class: center, middle
## Examples in [Scio](https://github.com/spotify/scio)
## Concepts apply to [Scalding](https://github.com/twitter/scalding) and [Spark](https://spark.apache.org/)

---

# Count streams per artist

```scala
case class Stream(user: String, artist: String, timestamp: Long)

def countStreamsPerArtist(input: SCollection[Stream]) =
  input
    .groupByKey(_.artist)
    .mapValues(_.size)
```

---

# Expanded

```scala
case class Stream(user: String, artist: String, timestamp: Long)

def countStreamsPerArtist(input: SCollection[Stream]) = {

val group: SCollection[(String, Iterable[Stream])] = input.groupByKey(_.artist)

val count: SCollection[(String, Long)] = group.mapValues { values: Iterable[Stream] =>
    values.size
  }

count
}
```

---

# Analyzed

```scala
case class Stream(user: String, artist: String, timestamp: Long)

def countStreamsPerArtist(input: SCollection[Stream]) = {

// user: String and timestamp: Long are grouped even though they are never used
  val group: SCollection[(String, Iterable[Stream])] = input.groupByKey(_.artist)

// values can be huge and must be shipped to a single reducer
  val count: SCollection[(String, Long)] = group.mapValues { values: Iterable[Stream] =>
    // Requires iterating through all elements
    // O(n) operation in many frameworks
    values.size
  }

count
}
```

---

# What's the problem?

- ### Network and disks way slower than RAM
- ### Shuffling is expensive
- ### Serialization is expensive
- ### Uneven workload among workers

---

---

---

# Doing it old school

Imagine a fake Scala M/R framework called Scuigi

```scala
trait ScuigiJob[I, K, V, O] {
  // Map from input elements to key-value pairs for shuffling
  def map(input: Iterator[I]): Iterator[(K, V)]

// Invisible shuffle layer

// Reduce values per key to output elements
  def reduce(kv: Iterator[(K, Iterable[V])]): Iterator[O]
}
```

```scala
class CountStreamsPerArtist extends ScuigiJob[Stream, String, Stream, (String, Long)] {
  def map(input: Iterator[Stream]): Iterator[(String, Stream)]
  def reduce(kv: Iterator[(String, Iterable[Stream])]): Iterator[(String, Long)]
}
```

---

# Implementation

```scala
class CountStreamsPerArtist extends ScuigiJob[Stream, String, Stream, (String, Long)] {
  def map(input: Iterator[Stream]): Iterator[(String, Stream)] =
    input.map(s => (s.artist, s))

def reduce(kv: Iterator[(String, Iterable[Stream])]): Iterator[(String, Long)] =
    kv.map { case (k, vs) =>
      (k, vs.size)
    }
}
```

---

# Can you tell the difference?

```scala
class CountStreamsPerArtist extends ScuigiJob[Stream, String, Long, (String, Long)] {
  def map(input: Iterator[Stream]): Iterator[(String, Long)] =
    input.map(s => (s.artist, 1L))

def reduce(kv: Iterator[(String, Iterable[Long])]): Iterator[(String, Long)] =
    kv.map { case (k, vs) =>
      (k, vs.size)
    }
}
```

---

# Let's cheat more

```scala
class CountStreamsPerArtist extends ScuigiJob[Stream, String, Long, (String, Long)] {
  def map(input: Iterator[Stream]): Iterator[(String, Long)] = {
    val m = mutable.Map.empty[String, Long]
    input.foreach { s =>
      count = m.getOrElse(s.artist, 0L)
      m(s.artist) = count + 1
    }
    m.iterator
  }

def reduce(kv: Iterator[(String, Iterable[Long])]): Iterator[(String, Long)] =
    kv.map { case (k, vs) =>
      (k, vs.sum)
    }
}
```

---

# Congratulations
# You just implemented a combiner

---

# All it does is 1 + 1 + 1 + ...
# On both mappers and reducers

---

# Let's generalize this

```scala
class CombineJob[K, V](f: (V, V)) extends ScuigiJob[(K, V), K, V, (K, V)] {
  def map(input: Iterator[(K, V)]): Iterator[(K, V)] = {
    val m = mutable.Map.empty[K, V]
    input.foreach { case (k, v) =>
      m(k) = if (m.contains(k)) f(m(k), v) else m(k) = v
    }
    m.iterator
  }

def reduce(kv: Iterator[(K, Iterable[V])]): Iterator[(K, V)] = {
    kv.map { case (k, vs) =>
      (k, vs.reduce(f))
    }
  }
}
```

---

# Congratulations
# You just implemented reduceByKey

---

# Count with reduce

```scala
case class Stream(user: String, artist: String, timestamp: Long)

def countStreamsPerArtist(input: SCollection[Stream]) =
  input
    .map(s => (s.artist, 1L))
    .reduceByKey(_ + _)
```

---

# This works only because

- ### (1 + 1) + 1 = 1 + (1 + 1) → associative property
- ### 1 + 2 = 2 + 1 → commutative property

---

# Congratulations
# You now know abstract algebra

---

# Semigroup

Given a set `\(S\)` and an operation `\(*\)`, we say that `\((S, *)\)` is a _semigroup_ if it satisfies the following properties for any `\(x, y, z \in S\)`:
- _Closure_: `\(x * y \in S\)`
- _Associativity_: `\((x * y) * z = x * (y * z)\)`

We also say that `\(S\)` _forms a semigroup under_ `\(*\)`.

---

# Examples of Semigroups

- ### Strings under concatenation (not commutative)
- ### Integers under plus (commutative)
- ### Sets under union (commutative)
- ### Bloom filters under bitwise OR (commutative)

---

# Implementing a Semigroup

```scala
trait Semigroup[T] {
  def plus(x: T, y: T): T
}
```

```scala
class LongSemigroup extends Semigroup[Long] {
  override def plus(x: Long, y: Long): Long = x + y
}
```

---

# Applying a Semigroup

```scala
trait SCollection[(K, V)] {
  def sumByKey(implicit sg: Semigroup[V]): SCollection[(K, V)] =
    this.reduceByKey(sg.plus)
}
```

```scala
implicit val longSemigroup = new LongSemigroup

def countStreamsPerArtist(input: SCollection[Stream]) =
  input
    .map(s => (s.artist, 1L))
    .sumByKey  // calling with implicit argument

```

---

# So all this work just to remove `(_+_)`?

---

# Hold on, what about

```scala
def sumColumns(input: SCollection[(String, (Int, Long, Float, Double))]) =
  input.reduceByKey { (x, y) =>
    (x._1 + y._1, x._2 + y._2, x._3 + y._3, x._4 + y._4)
  }
```

---

---

---

# With Algebird

```scala
import com.twitter.algebird._

def sumColumns(input: SCollection[(String, (Int, Long, Float, Double))]) =
  input.sumByKey // implicit Semigroup[(Int, Long, Float, Double)]
```

```scala
implicit def tuple4Semigroup[A, B, C, D](implicit sgA: Semigroup[A],
                                                  sgB: Semigroup[B],
                                                  sgC: Semigroup[C],
                                                  sgD: Semigroup[D]): Semigroup[(A, B, C, D)] =
  new Semigroup[(A, B, C, D)] {
    override def plus(x: (A, B, C, D), y: (A, B, C, D)): (A, B, C, D) =
      (sgA.plus(x._1, y._1), sgB.plus(x._2, y._2), sgC.plus(x._3, y._3), sgD.plus(x._4, y._4))
  }
```

---

# More Algebird

```scala
def sumMap(input: SCollection[(String, Map[String, Long])]) =
  input.sumByKey  // implicit Semigroup[Map[String, Long]]
```

```scala
implicit def mapSemigroup[K, V](implicit sgV: Semigroup[V]): Semigroup[Map[K, V]] =
  new Smigroup[Map[K, V]] {
    override def plus(x: Map[K, V], y: Map[K, V]): Map[K, V] =
      x ++ y.map { case (k, v) => k -> (if (x.contains(k)) sgV(x(k), v) else v) }
  }
```

---

# So no more artisan handcraft
# What else are you taking away from me!?

---

# Reducing fat objects

```scala
class DoubleArraySemigroup extends Semigroup[Array[Double]] {
  override def plus(x: Array[Double], y: Array[Double]): Array[Double] =
    (x zip y).map(p => (p._1 + p._2))
}
```

# One copy per pair of inputs

# Neunundneunzig per hundert

---

# Let's cheat again

```scala
class DoubleArraySemigroup extends Semigroup[Array[Double]] {
  override def plus(x: Array[Double], y: Array[Double]): Array[Double] = {
    var i = 0
    while (i < x.size) {
      x(i) += y(i)
      i += 1
    }
    x
  }
}
```

# Nice, but...

---

# Do you spot the problem?

```scala
val vectors: SCollection[(String, Array[Double])] = // ...
val sum = vectors.sumByKey

vectors
  .join(sum)
  .mapValues { case (vec, sumVec) =>
    (vec zip (sumVec)).map(p => (p._1 / p._2))
  }
```

# `vec` may have been mutated by `sumByKey`!

---

---

---

# Mutating in place

- ### Scalding - OK since it runs on M/R and no caching
- ### Spark - OK since cached data are serialized copies
- ### Scio - ERROR since Dataflow runner enforces immutable `fn`

---

# Cheat differently

```scala

class DoubleArraySemigroup extends Semigroup[Array[Double]] {
  override def plus(x: Array[Double], y: Array[Double]): Array[Double] = plusI(x.copy, y)
  override def sumOption(iterator: TraversableOnce[Double]): Option[Double] = {
    var x: Array[Double] = null
    iterator.foreach { y => if (x == null) x = y.copy else plusI(x, y) }
    Option(x)
  }
  def plusI(x: Array[Double], y: Array[Double]): Array[Double] = {
    var i = 0
    while (i < x.size) {
      x(i) += y(i)
      i += 1
    }
    x
  }
}
```

---

# Accumulate → buffer → sum

```scala
class ReduceFn[T](sg: Semigroup[T]) extends CombineFn[T, JList[T], T] {
  override def createAccumulator(): JList[T] = Lists.newArrayList()
  override def extractOutput(accumulator: JList[T]): T = sg.reduceOption(accumulator.asScala).get

override def addInput(accumulator: JList[T], input: T): JList[T] = {
    accumulator.add(input)
    if (accumulator.size > BUFFER_SIZE) {
      val combined = sg.sumOption(accumulator.asScala)
      accumulator.clear()
      combined.foreach(accumulator.add)
    }
    accumulator
  }

override def mergeAccumulators(accumulators: JIterable[JList[T]]): JList[T] = {
    val partial: Iterable[T] = accumulators.asScala.flatMap(a => sg.sumOption(a.asScala))
    sg.sumOption(partial).toList.asJava
  }

override def mergeAccumulators(accumulators: JIterable[JList[T]]): JList[T] = {
    val partial = accumulators.asScala.flatMap(a => sg.sumOption(a.asScala))
    val combined = sg.sumOption(partial).get
    Lists.newArrayList(combined)
  }
}
```

---

# `sumByKey` simplified

```scala
def sumByKey(implicit sg: Semigroup[V]): SCollection[(K, V)] =
  this.applyPerKey(Combine.perKey(new ReduceFn(sg)))
```

---

# Further Reading
- ## Databricks [Avoid GroupByKey](https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html)
- ## Algebird [wiki](https://github.com/twitter/algebird/wiki)
- ## Scio [AlgebirdSpec.scala](https://github.com/spotify/scio/blob/master/scio-examples/src/test/scala/com/spotify/scio/examples/extra/AlgebirdSpec.scala)

---