Why Functional?

Why Scala?

Neville Li
@sinisa_lyh

Jul 2014

Monoid!

Actually it's a semigroup, monoid just sounds more interesting :)

A Little Teaser

PGroupedTable<K,V>::combineValues(CombineFn<K,V> combineFn,
                                  CombineFn<K,V> reduceFn) 
Crunch: CombineFns are used to represent the associative operations...
KeyedList[K, T]::reduce(fn: (T, T) => T)
Scalding: reduce with fn which must be associative and commutative
PairRDDFunctions[K, V]::reduceByKey(fn: (V, V) => V)
Spark: Merge the values for each key using an associative reduce function
All of them work on both mapper and reducer side

My story

Before

  • Mostly Python/C++ (and PHP...)
  • No Java experience at all
  • Started using Scala early 2013

Now

  • Discovery's* Java backend/riemann guy
  • The Scalding/Spark/Storm guy
  • Contributor to Spark, chill, cascading.avro
* Spotify's machine learning and recommendation team

Why this talk?

  • Not a tutorial
  • Discovery's experience
  • Why FP matters
  • Why Scala matters
  • Common misconceptions

What we already use

  • Kafka
  • Scalding
  • Spark / MLLib
  • Stratosphere
  • Storm / Riemann (Clojure)

What we want to investigate

  • Summingbird (Scala for Storm + Hadoop)
  • Spark Streaming
  • Shark / SparkSQL
  • GraphX (Spark)
  • BIDMach (GPU ML with GPU)

Discovery

  • Mid 2013: 100+ Python jobs
  • 10+ hires since (half since new year)
  • Few with Java experience, none with Scala
  • As of May 2014: ~100 Scalding jobs & 90 tests
  • More uncommited ad-hoc jobs
  • 12+ commiters, 4+ using Spark

Discovery

rec-sys-scalding.git

rec-sys-scalding

Discovery

Guess how many jobs
written by yours truely?

3

Why Functional

  • Immutable data
  • Copy and transform
  • Not mutate in place
  • HDFS with M/R jobs
  • Storm tuples, Riemann streams

Why Functional

  • Higher order functions
  • Expressions, not statements
  • Focus on problem solving
  • Not solving programming problems

Why Functional

Word count in Python

lyrics = ["We all live in Amerika", "Amerika ist wunderbar"]
wc = defaultdict(int)
for l in lyrics:
  for w in l.split():
    wc[w] += 1
Screen too small for the Java version

Why Functional

Map and reduce are key concepts in FP

val lyrics = List("We all live in Amerika", "Amerika ist wunderbar")
lyrics.flatMap(_.split(" "))               // map
      .groupBy(identity)                   // shuffle
      .map { case (k, g) => (k, g.size) }  // reduce
(def lyrics ["We all live in Amerika" "Amerika ist wunderbar"])
(->> lyrics (mapcat #(clojure.string/split % #"\s"))
            (group-by identity)
            (map (fn [[k g]] [k (count g)])))
import Control.Arrow
import Data.List
let lyrics = ["We all live in Amerika", "Amerika ist wunderbar"]
map words >>> concat
          >>> sort >>> group
          >>> map (\x -> (head x, length x)) $ lyrics

Why Functional

Linear equation in ALS matrix factorization
\(x_u=(Y^TY + Y^T(C^u-I)Y)^{-1} Y^TC^up(u)\)
vectors.map { case (id, vec) => (id, vec * vec.T) }  // YtY
       .map(_._2).reduce(_ + _)
ratings.keyBy(fixedKey).join(outerProducts)  // YtCuIY
       .map { case (_, (r, op)) => (solveKey(r), op * (r.rating * alpha)) }
       .reduceByKey(_ + _)
ratings.keyBy(fixedKey).join(vectors)  // YtCupu
       .map { case (_, (r, vec)) =>
         val Cui = r.rating * alpha + 1
         val pui = if (Cui > 0.0) 1.0 else 0.0
         (solveKey(r), vec * (Cui * pui))
       }.reduceByKey(_ + _)

Why Scala

  • JVM - libraries and tools
  • Pythonesque syntax
  • Static typing with inference
  • Transition from imperative to FP

Why Scala

Performance vs. agility

performance http://nicholassterling.wordpress.com/2012/11/16/scala-performance/

Why Scala

Type inference

class ComplexDecorationService {
  public List<ListenableFuture<Map<String, Metadata>>>
  lookupMetadata(List<String> keys) { /* ... */ }
}
val data = service.lookupMetadata(keys)

type DF = List[ListenableFuture[Map[String, Track]]]
def process(data: DF) = { /* ... */ }

Why Scala

Higher order functions

List<Integer> list = Lists.newArrayList(1, 2, 3);
Lists.transform(list, new Function<Integer, Integer>() {
  @Override
  public Integer apply(Integer input) {
    return input + 1;
  }
});
val list = List(1, 2, 3)
list.map(_ + 1)  // List(2, 3, 4)
And then imagine if you have to chain or nested functions

Why Scala

Collections API

val l = List(1, 2, 3, 4, 5)
l.map(_ + 1)                      // List(2, 3, 4, 5, 6)
l.filter(_ > 3)                   // 4 5
l.zip(List("a", "b", "c")).toMap  // Map(1 -> a, 2 -> b, 3 -> c)
l.partition(_ % 2 == 0)           // (List(2, 4),List(1, 3, 5))
List(l, l.map(_ * 2)).flatten     // List(1, 2, 3, 4, 5, 2, 4, 6, 8, 10)

l.reduce(_ + _)                   // 15
l.fold(100)(_ + _)                // 115
"We all live in Amerika".split(" ").groupBy(_.size)
// Map(2 -> Array(We, in), 4 -> Array(live),
//     7 -> Array(Amerika), 3 -> Array(all))

Why Scala

Scalding field based word count

TextLine(path))
  .flatMap('line -> 'word) { line: String => line.split("""\W+""") }
  .groupBy('word) { _.size }

Scalding type-safe word count

TextLine(path).read.toTypedPipe[String](Fields.ALL)
  .flatMap(_.split(""\W+""))
  .groupBy(identity).size

Scrunch word count

read(from.textFile(file))
  .flatMap(_.split("""\W+""")
  .count

Why Scala

Summingbird word count

source
  .flatMap { line: String => line.split("""\W+""").map((_, 1)) }
  .sumByKey(store)

Spark word count

sc.textFile(path)
  .flatMap(_.split("""\W+"""))
  .map(word => (word, 1))
  .reduceByKey(_ + _)

Stratosphere word count

TextFile(textInput)
  .flatMap(_.split("""\W+"""))
  .map(word => (word, 1))
  .groupBy(_._1)
  .reduce { (w1, w2) => (w1._1, w1._2 + w2._2) }

Why Scala

Many patterns also common in Java

  • Java 8 lambdas and streams
  • Guava, Crunch, etc.
  • Optional, Predicate
  • Collection transformations
  • ListenableFuture and transform
  • parallelDo, DoFn, MapFn, CombineFn

Common misconceptions

It's complex

  • True for language features
  • Not from user's perspective
  • We only use 20% features
  • Not more than needed in Java

Common misconceptions

It's slow

  • No slower than Python
  • Depend on how pure FP
  • Trade off with productivity
  • Drop down to Java or native libraries

Common misconceptions

I don't want to learn a new language

  • How about flatMap, reduce, fold, etc.?
  • Unnecessary overhead
    interfacing with Python or Java
  • You've used monoids, monads,
    or higher order functions already

The End

Thank You