Case Studies

  • Setup
  • Scalding
  • Spark
  • Exercises

Setup

Running locally

  • Both Scalding and Spark can run locally
  • No need for Hadoop or even network
  • Scalding - local distribution, run .scala/jar with scald.rb
  • Spark - build and run with sbt
  • Avro support - scp files from HDFS
  • Unit tests!
Scalding

Scalding

Introduction

  • Scala wrapper for Cascading*
  • Source & sink - input & output (local/HDFS)
  • Pipe - collection of tuples
  • Tuple - named fields
  • Flow - source → pipe → sink
  • Compiled to Map/Reduce jobs
  • Field-based and type safe API
* Also see Cascalog (Clojure), PyCascading, Cascading.JRuby

Scalding

Ecosystem

  • Driven - commercial pipeline visualization tool
  • Modules for Avro, JSON, JDBC, Parquet
  • ML - Conjecture and Pattern
  • Reducer auto-tuning!
  • Cascading on Tez (3.0) out soon
  • Many other extensions

Scalding

Word count

import com.twitter.scalding._

class WordCountJob(args : Args) extends Job(args) {
  TextLine(args("input"))        // --input hdfs://... -> [line]
    .read                        // input -> pipe
    .flatMap('line -> 'word) {   // [line] -> [line, word]
      line: String => line.split("""\s+""")
    }
    .groupBy('word) { _.size }   // [word, size]
    .write(Tsv(args("output")))  // --output hdfs://...
}

Scalding

Sources and sinks

// TSV with 4 fields
val pipe = Tsv(args("input"), ('user, 'email, 'country, 'product)).read

pipe.write(Tsv(args("output")))  // fields inferred from input
pipe.write(Tsv(args("output"), ('user, 'email)))  // keep only 2

pipe1 ++ pipe2  // union 2 pipes that have the same fields

Scalding

Working with fields

val pipe = Tsv(args("input1"), ('user, 'track, 'artist, 'time)).read

pipe.project('user, 'track)   // keep these 2
pipe.discard('artist, 'time)  // throw away these 2

// insert 2 constants to every tuple
pipe.insert(('alpha, 'beta), (0.02, 0.01))

// useful before join
pipe.rename(('user, 'item) -> ('userLHS, 'itemLHS))

Scalding

Map and flatMap

val pipe = Tsv(args("input"), ('user, 'time, 'uri, 'msg)).read

// new field "userGroup" from "user"
pipe.map('user -> 'userGroup) { user: String => user.hashCode }

pipe
  .project('uri, 'msg)
  .flatMap('msg -> 'token) { msg: String => text.split("\\s+") }
/*
["spotify:track:J3P53n", "so call me maybe"] ->
  ["spotify:track:J3P53n", "so call me maybe", "so"]
  ["spotify:track:J3P53n", "so call me maybe", "call"]
  ["spotify:track:J3P53n", "so call me maybe", "me"]
  ["spotify:track:J3P53n", "so call me maybe", "maybe"] */
Use mapTo and flatMapTo when only new fields are needed

Scalding

Filter

val pipe = Tsv(args("input"), ('user, 'time, 'uri, 'msg)).read

// keep tracks
pipe.filter('uri) { uri: String => uri.startsWith("spotify:track:") }

// get rid of spams
pipe.filterNot('user, 'msg) { fields: (String, String) =>
  val (user, msg) = fields
  user == "spammer.inc" || msg == "call me maybe"
}
See RichPipe ScalaDoc for more operations

Scalding

Group

  • Group by some fields
  • Rest of the fields go to GroupBuilder
  • GroupBuilder methods aggregate these fields
val pipe = Tsv(args("input"), ('user, 'item, 'rating)).read

// groupBy(f: Fields)(builder: (GroupBuilder) => GroupBuilder): Pipe
pipe.groupBy('user) {  // uri -> [[item, rating], [item, rating], ...]
  _.size               // [[item, rating], ...] -> Int
}

Scalding

Group

val pipe = Tsv(args("input"), ('user, 'item, 'rating)).read
pipe.groupBy('user) {  // user -> [[item, rating], [item, rating], ...]
  -
    .reducers(1000)             // number of reducers, jbx will kill you!
    .size('n)                   // [user, n]
    .average('rating -> 'avgR)  // [user, n, avgR]
    .sum('rating -> 'sumR)      // [user, n, avgR, sumR]
    .max('rating -> 'maxR)      // [user, n, avgR, sumR, maxR]
    .min('rating -> 'minR)      // [user, n, avgR, sumR, maxR, minR]
}

// [user, n, avgR, stdevR]
pipe.groupBy('user) { _.sizeAveStdev('rating -> ('n, 'avgR, 'stdevR)) }

Scalding

Group reduce

val pipe = Tsv(args("input"), ('user, 'uri, 'duration)).read
pipe.groupBy('uri) {
  // reduce function is commutative
  // done on both mapper and reducer side
  _.reduce('duration -> 'totalDuration) { (d1: Int, d2: Int) => d1 + d2 }
}

pipe.groupBy('user) {
  // foldLeft(fields -> newFields)(init) { function }
  // done only on reducer side
  _.foldLeft('item -> 'uniqueItems)(Set[Int]()) {
    (uris: Set[Int], uri: Int) => uris + uri
  }
}

Scalding

More group operations

def int2set(i: Int): Set[Int] = Set(i)
def setUnion(s1: Set[Int], s2: Set[Int]): Set[Int] = s1 ++ s2
def set2str(s: Set[Int]): String = s.mkString(":")

val pipe = Tsv(args("input"), ('user, 'uri, 'duration)).read
pipe.groupBy('user) {
  // mapReduceMap(fields -> newFields)(mapFn1)(ReduceFn)(mapFn2)
  // T, X, U -> original, intermediate, result type
  // mapFn1: (T) => X, mapper side
  // reduceFn: (X, X) => X, both mapper and reducer side
  // mapFn2: (X) => U, reducer side
  _.mapReduceMap('item -> 'uniqueItems)(int2set)(setUnion)(set2str)
}
See GroupBuilder ScalaDoc for more operations

Scalding

Joins

  • joinWithSmaller - preferred
  • joinWithLarger - reverse of joinWithSmaller
  • joinWithTiny - entirely mapper side
val ratings = Tsv(args("ratings"), ('user, 'item, 'rating)).read')).read
val names = Tsv(args("names"), ('item, 'name)).read

ratings
  .groupBy('item) { _.average('avgRating) }  // [item, avgRating]
  // (LHS fields -> RHS fields, RHS pipe)
  .joinWithSmaller('item -> 'item, names)    // [item, avgRating, name]

Scalding

Functions on multiple fields

Tsv(args("input"), ('user, 'item, 'rating)).read
  .map(('user, 'item) -> ('userGroup, 'itemType)) {
    // (Tuple2[String, String]) => Tuple2[String, String]
    fields: (String, String) =>
    val (user, item) = fields
    (user.hashCode, item.split(":")(1))
  }
import com.twitter.scalding.FunctionImplicits._

Tsv(args("input"), ('user, 'item, 'rating)).read
  .map(('user, 'item) -> ('userGroup, 'itemType)) {
    // (Tuple2[String, String]) => Tuple2[String, String]
    // implicitly converted to
    // (String, String) => Tuple2[String, String]
    (user: String, item: String) =>
    (user.hashCode, item.split(":")(1))
  }

Safe Scalding

Type safe API

  • Pipe with only 1 typed field: TypedPipe[T]
  • Key-value (1-to-n) during group/join: Grouped[K, V]
  • Easy mix & match with field-based API
  • Behaves just like standard collections
  • Better Avro support

Safe Scalding

Back-n-forth

Between Pipe and TypedPipe[T]

// field-based input, 3 untyped fields
Tsv(args("input"), ('username, 'trackGid, 'count)).read
  // convert to TypedPipe of Tuple3
  .toTypedPipe[(String, String, Int)]('username, 'trackGid, 'count)
unsafe in, safe out
PackedAvroSource[EndSongCleaned](args("input"))
  .map(e => (e.getUsername.toString, e.getTrackid.toString, e.getMsPlayed))
  .toPipe('username, 'trackId, 'msPlayed)
safe in, unsafe out

Safe Scalding

TypedPipe[T]

  • map, flatMap & filter work just like standard collection
  • groupBy(fn: T => K)Grouped[K, T]
    TypedPipe[EndSongCleaned].groupBy(_.getUsername.toString)
    Grouped[String, EndSongCleaned]
  • group when T == (K, V)Grouped[K, V]
    TypedPipe[(String, Int)].groupedGrouped[String, Int]
  • groupAllGrouped[Unit, T]
    TypedPipe[T].groupAllGrouped[Unit, T]
    Force everything to 1 reducer, slow!

Safe Scalding

Grouped[K, V]

  • Key → many values
  • Can join with another Grouped[K, W]
    CoGrouped[K, (V, W)]
  • Tune with withReducers(reds: Int)
  • Either reduce values n-to-1:
    \(k \rightarrow [v_1, v_2 \dots v_n]\) to \(k \rightarrow v'\)
  • Or n-to-m:
    \(k \rightarrow [v_1, v_2 \dots v_n]\) to \(k \rightarrow [v_1', v_2' \dots v_m']\)
  • Convert back to TypedPipe[(K, V)] when done:
    \(k \rightarrow [v_1, v_2 \dots v_m]\) to \([k \rightarrow v_1', k \rightarrow v_2' \dots k \rightarrow v_m']\)

Safe Scalding

Grouped[K, V] value reduction operations

  • reduce(fn: (V, V) => V) values n-to-1
    both mapper and reducer side
  • foldLeft(fn: (B, V) => B) values n-to-1
    reducer side only
  • max, min, size, sum, product, etc. also n-to-1
  • count(fn: V => Boolean)
    forall(fn: V => Boolean)
    also n-to-1 with predicate

Safe Scalding

Grouped[K, V] value n-to-m operations

  • mapValues(fn: V => U) map values, n-to-n
  • mapValueStream(fn: Iterator[V] => Iterator[U])
    map values lazily, n-to-m
  • mapGroup(fn: (K, Iterator[V]) => Iterator[U])
    with key as additional input, n-to-m
  • take, takeWhile, drop, dropWhile, and sort*

Safe Scalding

Type safe word count

package com.spotify.scalding.tutorial

import com.twitter.scalding._
import TDsl._

class Tutorial1(args: Args) extends Job(args) {
  val input = TypedTsv[String](args("input"))  // TypedPipe[String]
    .filter(_ != null)                         // TypedPipe[String], fewer
    .flatMap(_.split("""\s+"""))               // TypedPipe[String], more
    .groupBy(identity)                         // Grouped[String, String]
    .size                                      // UnsortedGroup[String, Long]
    .toTypedPipe                               // TypedPipe[(String, Long)]
    .write(TypedTsv[(String, Long)](args("output")))
}
Spark

Spark

Spark

Introduction

  • In memory computation
  • Multiple M/R stages without intermediate I/O
  • Input/output - local, HDFS, stream, DB
  • RDD* - parallelized collection across nodes
  • One master, many workers
  • Works on standalone, Mesos or YARN clusters
* Resilient Distributed Dataset

Spark

Word count

import org.apache.spark._
import org.apache.spark.SparkContext._

object WordCount {
  def main(args: Array[String]) {
    // args(0) is master, local or yarn-standalone
    val sc = new SparkContext(args(0), "Tutorial0")  // one context per job

    sc.textFile(args(1))        // local/HDFS input, RDD[String]
      .flatMap { line => line.split("""\s+""") }  // RDD[String]
      .map(word => (word, 1))   // RDD[(String, Int)]
      .reduceByKey(_ + _)       // RDD[(String, Int)], fewer items
      .saveAsTextFile(args(2))  // local/HDFS output
  }
}

Spark

RDD API

  • RDD - Resilient Distributed Dataset
  • More operators via Pimp My Library pattern
  • RDD[Double] - DoubleRDDFunctions
    histogram, mean, stdev, sum, variance, etc.
  • RDD[K, V] - PairRDDFunctions
    group, join, {count/fold/reduce}ByKey etc.

Spark

Workflow

  • Code in main runs sequentially on master (driver)
  • Transformations (RDDRDD): in parallel on executors
  • Actions (RDD → local value): executors → driver
  • Broadcast (local value → executors): driver → executors
  • Transformations are either
    within local partitions (map, flatMap, filter, ...)
    or over network shuffle (reduceByKey, groupByKey, ...)

Spark

MLlib

  • Common ML functionality on Spark
  • Classification, regression, clustering, CF
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.recommendation.{Rating, ALS}

object ImplicitALS {
  def main(args: Array[String]) {
    val sc = new SparkContext(args(0), "ImplicitALS")
    val ratings = sc.textFile(args(1)).map { l: String =>
      val t = l.split('\t')
      Rating(t(0).toInt, t(1).toInt, t(2).toFloat)
    }
    ALS.trainImplicit(ratings, 40, 20, 0.8, 0.2)
      .productFeatures
      .map { case (id, vec) => id + "\t" + vec.mkString(" ") }
      .saveAsTextFile(args(2))
  }
}

Spark

Performance tuning

  • Use Kyro instead of Java serialization
  • Use simple data structures
  • Tune partition to avoid network overhead
  • Use web UI (when YARN application is RUNNING)

Exercises

programming

Exercises

Test data

  • Play count of top 10k users and 1k artists
  • projects/data/user-artist-1k
    TSV of userId, artistId, playCount
  • projects/data/id-to-name
    TSV of artistId, artistName

Exercises

Options

  • Pick either Scalding or Spark
  • Top 10 artists by total play count
  • Top 10 artists with most unique listeners
  • Top 10 artists of each user
  • Users who listen to Megadeth also listened to?
  • Join results above with id-to-name

Exercises

Spark

  • Implicit matrix factorization (MLlib ALS)
  • Save item vectors to disk
  • Find Cosine similarity between 2 artists you like
  • Find 10 most similar artists to your favorite

That's It

Further reading

The End

Thank You