HandsOn-SparkSQL_with_solutions.ipynb Open in SWAN Download

HandsOn-SparkSQL_with_solutions.ipynb

This notebook is part of the Apache Spark training delivered by CERN-IT

Spark SQL Hands-On Lab with Solutions

Contact: Luca.Canali@cern.ch

Run this notebook from Jupyter with Python kernel

  • When using on CERN SWAN, do not attach the notebook to a Spark cluster, but rather run locally on the SWAN container
  • If running this outside CERN SWAN, plese make sure to tha PySpark installed: pip install pyspark

Examples datasets

The following examples use sample data provided in the repository.
We will use the movielens dataset from Kaggle, credits: https://www.kaggle.com/grouplens/movielens-20m-dataset

In [ ]:
# Create Spark Session, you need this to work with Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder \
        .appName("My spark example app")  \
        .master("local[*]") \
        .config("spark.driver.memory","8g") \
        .config("spark.ui.showConsoleProgress", "false") \
        .getOrCreate()
In [2]:
spark
Out[2]:

SparkSession - in-memory

SparkContext

Spark UI

Version
v3.3.1
Master
local[*]
AppName
My spark example app
In [3]:
# sets the path to the directory with datafiles
PATH = "../data/"
In [4]:
ratings = spark.read.option("header","true").csv(PATH + "ratings1.csv.gz")
In [5]:
ratings.show(5)
+------+-------+------+----------+
|userId|movieId|rating| timestamp|
+------+-------+------+----------+
|     1|    110|   1.0|1425941529|
|     1|    147|   4.5|1425942435|
|     1|    858|   5.0|1425941523|
|     1|   1221|   5.0|1425941546|
|     1|   1246|   5.0|1425941556|
+------+-------+------+----------+
only showing top 5 rows

In [6]:
ratings.printSchema()
root
 |-- userId: string (nullable = true)
 |-- movieId: string (nullable = true)
 |-- rating: string (nullable = true)
 |-- timestamp: string (nullable = true)

In [7]:
# infer schema needs to go through the data to estimate the schema, this takes time

%time ratings = spark.read.option("header","true").option("inferSchema", "true").csv(PATH + "ratings1.csv.gz")

# note ratings*.csv.gz will read rating1.csv.gz and ratings2.csv.gz, more data, however slower to run
# spark.read.option("header","true").option("inferSchema", "true").csv(PATH + "ratings*.csv.gz")
CPU times: user 7.62 ms, sys: 6.18 ms, total: 13.8 ms
Wall time: 26.9 s
In [8]:
ratings.printSchema()
root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)

In [9]:
# movielens dataset
movies = spark.read.option("header","true").option("inferSchema", "true").csv(PATH + "movies.csv.gz")
In [10]:
movies.show(5, False)
+-------+----------------------------------+-------------------------------------------+
|movieId|title                             |genres                                     |
+-------+----------------------------------+-------------------------------------------+
|1      |Toy Story (1995)                  |Adventure|Animation|Children|Comedy|Fantasy|
|2      |Jumanji (1995)                    |Adventure|Children|Fantasy                 |
|3      |Grumpier Old Men (1995)           |Comedy|Romance                             |
|4      |Waiting to Exhale (1995)          |Comedy|Drama|Romance                       |
|5      |Father of the Bride Part II (1995)|Comedy                                     |
+-------+----------------------------------+-------------------------------------------+
only showing top 5 rows

In [11]:
movies.printSchema()
root
 |-- movieId: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- genres: string (nullable = true)

In [12]:
tags = spark.read.option("header","true").option("inferSchema", "true").csv(PATH + "tags.csv.gz")
In [13]:
tags.show(5)
+------+-------+----------+----------+
|userId|movieId|       tag| timestamp|
+------+-------+----------+----------+
|     1|    318|  narrated|1425942391|
|    20|   4306|Dreamworks|1459855607|
|    20|  89302|   England|1400778834|
|    20|  89302| espionage|1400778836|
|    20|  89302|      jazz|1400778841|
+------+-------+----------+----------+
only showing top 5 rows

Register the dataframes as Spark Temporary Views

In [14]:
ratings.createOrReplaceTempView("ratings")
movies.createOrReplaceTempView("movies")
tags.createOrReplaceTempView("tags")
In [15]:
# note what happens when we query a table in a csv file with a filter
spark.sql("select movieId, title from movies where movieId=1").explain()
== Physical Plan ==
*(1) Filter (isnotnull(movieId#85) AND (movieId#85 = 1))
+- FileScan csv [movieId#85,title#86] Batched: false, DataFilters: [isnotnull(movieId#85), (movieId#85 = 1)], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/eos/home-c/canali/SWAN_projects/HadoopSparkTraining/Spark/data/m..., PartitionFilters: [], PushedFilters: [IsNotNull(movieId), EqualTo(movieId,1)], ReadSchema: struct<movieId:int,title:string>


In [16]:
# cache the tables, to improve the performance of the rest of the queries in the notebook
# note: default caching level is MEMORY_AND_DISK (i.e. caching in memory if enough heap is available)
# note: caching is lazily executed, so a count() action is added to make the operation happen
# this operation may take a couple of minutes

r = ratings.cache().count()
m = movies.cache().count()
t = tags.cache().count()
In [17]:
print(f"Num ratings = {r}\nNum tags = {t}\nNum movies = {m}")
Num ratings = 13012144
Num tags = 753170
Num movies = 45843

SQL Queries on the loaded tables

1) How many movies produced per year?

In [18]:
# Add the column Year to "movies"
movies_year = spark.sql("select *, regexp_extract(title,'^(.*) \\\\(([0-9 \\\\-]*)\\\\)$',2) as Year from movies")
movies_year.show(5,False)
movies_year.createOrReplaceTempView("movies_year")

# This is the DataFrame API equivalent, not that \\ oddly need to be changed to \\\\\ when using SQL, at least in this version of Spark
# from pyspark.sql.functions import regexp_extract
# movies_year = movies.withColumn("Year",regexp_extract("title",'^(.*) \\(([0-9 \\-]*)\\)$',2))
+-------+----------------------------------+-------------------------------------------+----+
|movieId|title                             |genres                                     |Year|
+-------+----------------------------------+-------------------------------------------+----+
|1      |Toy Story (1995)                  |Adventure|Animation|Children|Comedy|Fantasy|1995|
|2      |Jumanji (1995)                    |Adventure|Children|Fantasy                 |1995|
|3      |Grumpier Old Men (1995)           |Comedy|Romance                             |1995|
|4      |Waiting to Exhale (1995)          |Comedy|Drama|Romance                       |1995|
|5      |Father of the Bride Part II (1995)|Comedy                                     |1995|
+-------+----------------------------------+-------------------------------------------+----+
only showing top 5 rows

In [19]:
# number of movies per year
m_yr = spark.sql("select year, count(1) as count from movies_year group by year order by year").toPandas()
In [20]:
%matplotlib notebook
import matplotlib.pyplot as plt
import pandas as pd
plt.style.use('seaborn-darkgrid')
In [21]:
m_yr.plot(x='year',y='count',kind='line', title='Movies per year');
No description has been provided for this image

2) Top movies by number of ratings

In [22]:
# A query to perform a join operation between movies and ratings
# Find the highest rated movies

spark.sql("""
select title, count(*) 
from movies m, ratings r 
where m.movieId = r.movieId
group by title 
order by 2 desc""").limit(5).toPandas()
Out[22]:
title count(1)
0 Forrest Gump (1994) 45782
1 Shawshank Redemption, The (1994) 45546
2 Pulp Fiction (1994) 43755
3 Silence of the Lambs, The (1991) 41807
4 Matrix, The (1999) 38860

What happened in the background? How did the join happen?

How is the query executed? Can we get more information?

In [23]:
spark.sql("""
select title, count(*) 
from movies m, ratings r 
where m.movieId = r.movieId
group by title 
order by 2 desc""").explain(True)
== Parsed Logical Plan ==
'Sort [2 DESC NULLS LAST], true
+- 'Aggregate ['title], ['title, unresolvedalias('count(1), None)]
   +- 'Filter ('m.movieId = 'r.movieId)
      +- 'Join Inner
         :- 'SubqueryAlias m
         :  +- 'UnresolvedRelation [movies], [], false
         +- 'SubqueryAlias r
            +- 'UnresolvedRelation [ratings], [], false

== Analyzed Logical Plan ==
title: string, count(1): bigint
Sort [count(1)#866L DESC NULLS LAST], true
+- Aggregate [title#86], [title#86, count(1) AS count(1)#866L]
   +- Filter (movieId#85 = movieId#62)
      +- Join Inner
         :- SubqueryAlias m
         :  +- SubqueryAlias movies
         :     +- View (`movies`, [movieId#85,title#86,genres#87])
         :        +- Relation [movieId#85,title#86,genres#87] csv
         +- SubqueryAlias r
            +- SubqueryAlias ratings
               +- View (`ratings`, [userId#61,movieId#62,rating#63,timestamp#64])
                  +- Relation [userId#61,movieId#62,rating#63,timestamp#64] csv

== Optimized Logical Plan ==
Sort [count(1)#866L DESC NULLS LAST], true
+- Aggregate [title#86], [title#86, count(1) AS count(1)#866L]
   +- Project [title#86]
      +- Join Inner, (movieId#85 = movieId#62)
         :- Project [movieId#85, title#86]
         :  +- Filter isnotnull(movieId#85)
         :     +- InMemoryRelation [movieId#85, title#86, genres#87], StorageLevel(disk, memory, deserialized, 1 replicas)
         :           +- FileScan csv [movieId#85,title#86,genres#87] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/eos/home-c/canali/SWAN_projects/HadoopSparkTraining/Spark/data/m..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<movieId:int,title:string,genres:string>
         +- Project [movieId#62]
            +- Filter isnotnull(movieId#62)
               +- InMemoryRelation [userId#61, movieId#62, rating#63, timestamp#64], StorageLevel(disk, memory, deserialized, 1 replicas)
                     +- FileScan csv [userId#61,movieId#62,rating#63,timestamp#64] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/eos/home-c/canali/SWAN_projects/HadoopSparkTraining/Spark/data/r..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<userId:int,movieId:int,rating:double,timestamp:int>

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count(1)#866L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(count(1)#866L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#514]
      +- HashAggregate(keys=[title#86], functions=[count(1)], output=[title#86, count(1)#866L])
         +- Exchange hashpartitioning(title#86, 200), ENSURE_REQUIREMENTS, [id=#511]
            +- HashAggregate(keys=[title#86], functions=[partial_count(1)], output=[title#86, count#975L])
               +- Project [title#86]
                  +- BroadcastHashJoin [movieId#85], [movieId#62], Inner, BuildLeft, false
                     :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#506]
                     :  +- Filter isnotnull(movieId#85)
                     :     +- InMemoryTableScan [movieId#85, title#86], [isnotnull(movieId#85)]
                     :           +- InMemoryRelation [movieId#85, title#86, genres#87], StorageLevel(disk, memory, deserialized, 1 replicas)
                     :                 +- FileScan csv [movieId#85,title#86,genres#87] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/eos/home-c/canali/SWAN_projects/HadoopSparkTraining/Spark/data/m..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<movieId:int,title:string,genres:string>
                     +- Filter isnotnull(movieId#62)
                        +- InMemoryTableScan [movieId#62], [isnotnull(movieId#62)]
                              +- InMemoryRelation [userId#61, movieId#62, rating#63, timestamp#64], StorageLevel(disk, memory, deserialized, 1 replicas)
                                    +- FileScan csv [userId#61,movieId#62,rating#63,timestamp#64] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/eos/home-c/canali/SWAN_projects/HadoopSparkTraining/Spark/data/r..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<userId:int,movieId:int,rating:double,timestamp:int>

3) Highly rated movies

Find the top 5 highly rated movies

In [24]:
spark.sql("""select title, avg(rating) as avg_rating from movies m, ratings r
            where m.movieId = r.movieId
            group by title
            order by 2 desc""").show(5, False)
+-----------------------------------+----------+
|title                              |avg_rating|
+-----------------------------------+----------+
|The Hardy Bucks Movie (2013)       |5.0       |
|Naked Souls (1996)                 |5.0       |
|Paul Goodman Changed My Life (2011)|5.0       |
|Jimi Plays Berkeley (2012)         |5.0       |
|Punk in London (1977)              |5.0       |
+-----------------------------------+----------+
only showing top 5 rows

Drill down on the top entries:

  • How many reviews contributed to this rating?
In [25]:
spark.sql("""select title, avg(rating) as avg_rating, count(*) as count from movies m, ratings r
            where m.movieId = r.movieId
            group by title
            order by 2 desc""").show(5, False)
+-----------------------------------+----------+-----+
|title                              |avg_rating|count|
+-----------------------------------+----------+-----+
|The Hardy Bucks Movie (2013)       |5.0       |1    |
|Naked Souls (1996)                 |5.0       |1    |
|Paul Goodman Changed My Life (2011)|5.0       |1    |
|Jimi Plays Berkeley (2012)         |5.0       |1    |
|Punk in London (1977)              |5.0       |1    |
+-----------------------------------+----------+-----+
only showing top 5 rows

Lets only take in account movies that have more than 100 reviews

In [26]:
spark.sql("""select title, avg(rating) as avg_rating, count(*) as count from movies m, ratings r
            where m.movieId = r.movieId
            group by title
            having count(*) > 100
            order by 2 desc""").limit(20).toPandas()
Out[26]:
title avg_rating count
0 Planet Earth (2006) 4.467391 368
1 Band of Brothers (2001) 4.431655 139
2 Shawshank Redemption, The (1994) 4.426338 45546
3 Godfather, The (1972) 4.335648 28582
4 Usual Suspects, The (1995) 4.299494 29635
5 Godfather: Part II, The (1974) 4.266718 18319
6 Seven Samurai (Shichinin no samurai) (1954) 4.265507 6900
7 Schindler's List (1993) 4.261945 33780
8 The Blue Planet (2001) 4.234615 130
9 Fight Club (1999) 4.232034 29931
10 One Flew Over the Cuckoo's Nest (1975) 4.230852 19937
11 12 Angry Men (1957) 4.229520 8374
12 Rear Window (1954) 4.229511 10542
13 Paths of Glory (1957) 4.218140 2150
14 Casablanca (1942) 4.215292 14903
15 Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.214214 3975
16 North by Northwest (1959) 4.211488 9445
17 Third Man, The (1949) 4.210196 3825
18 Spirited Away (Sen to Chihiro no kamikakushi) ... 4.209656 10398
19 Dr. Strangelove or: How I Learned to Stop Worr... 4.209441 13992

4) Find the top rated movie of every year since 2000

In [27]:
avg_ratings = spark.sql("""select year, title, round(avg(rating),2) as avg_rating, count(*) as count
                           from movies_year m, ratings r where m.movieId = r.movieId 
                           group by year, title
                           having count(*) > 100""")

avg_ratings.createOrReplaceTempView("avg_ratings")

# note this is just the definition of a helper view
# because of lazy execution no query is run at this step
In [28]:
# the query for top-rated movies is run here, triggered by the action to show the first 20 rows

spark.sql("""select a.year, a.title, avg_rating from avg_ratings a,
            (select year, max(avg_rating) as max_rating from avg_ratings group by year) m
            where a.year = m.year
            and a.avg_rating = m.max_rating
            and a.year > 2000
            order by year""").show(20, False)
+----+--------------------------------------------------------------------------+----------+
|year|title                                                                     |avg_rating|
+----+--------------------------------------------------------------------------+----------+
|2001|Band of Brothers (2001)                                                   |4.43      |
|2002|City of God (Cidade de Deus) (2002)                                       |4.19      |
|2003|Fog of War: Eleven Lessons from the Life of Robert S. McNamara, The (2003)|4.11      |
|2003|Lord of the Rings: The Return of the King, The (2003)                     |4.11      |
|2004|Voices from the List (2004)                                               |4.15      |
|2005|49 Up (2005)                                                              |4.06      |
|2006|Planet Earth (2006)                                                       |4.47      |
|2007|Like Stars on Earth (Taare Zameen Par) (2007)                             |4.04      |
|2008|Dark Knight, The (2008)                                                   |4.18      |
|2009|Secret in Their Eyes, The (El secreto de sus ojos) (2009)                 |4.03      |
|2010|Inception (2010)                                                          |4.16      |
|2011|Intouchables (2011)                                                       |4.12      |
|2012|Hunt, The (Jagten) (2012)                                                 |4.07      |
|2013|Doctor Who: The Time of the Doctor (2013)                                 |4.07      |
|2014|Wild Tales (2014)                                                         |4.15      |
|2015|The Jinx: The Life and Deaths of Robert Durst (2015)                      |4.09      |
|2016|Piper (2016)                                                              |4.1       |
|2017|Dunkirk (2017)                                                            |4.11      |
+----+--------------------------------------------------------------------------+----------+

In [29]:
# End the Spark application
spark.stop()
In [ ]: