Tutorial-DataFrame.ipynb
This notebook is part of the Apache Spark training delivered by CERN IT¶
Run this notebook from Jupyter with Python kernel
- When running on CERN SWAN, do not attach the notebook to a Spark cluster, but rather run it locally on the SWAN container (which is the default)
- If running this outside CERN SWAN, please make sure to have PySpark installed:
pip install pyspark
In order to run this notebook as slides:
- on SWAN click on the button "Enter/Exit RISE slideshow" in the ribbon
- on other environments please make sure to have the RISE extension installed
pip install RISE
Getting started: create the SparkSession¶
# !pip install pyspark
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.master("local[*]") \
.appName("DataFrame HandsOn 1") \
.config("spark.ui.showConsoleProgress","false") \
.getOrCreate()
)
spark
The master local[*]
means that the executors are in the same node that is running the driver. The *
tells Spark to start as many executors as there are logical cores available
Hands-On 1 - Construct a DataFrame from csv file¶
This demostrates how to read a csv file and construct a DataFrame.
We will use the online retail dataset from Kaggle, credits: https://www.kaggle.com/datasets/vijayuv/onlineretail
First, let's inspect the csv content¶
!gzip -cd ../data/online-retail-dataset.csv.gz 2>&1| head -n3
online_retail_schema="InvoiceNo int, StockCode string, Description string, Quantity int,\
InvoiceDate timestamp,UnitPrice float,CustomerId int, Country string"
df = (spark.read
.option("header", "true")
.option("timestampFormat", "M/d/yyyy H:m")
.csv("../data/online-retail-dataset.csv.gz",
schema=online_retail_schema)
)
Inspect the data¶
df.show(2, False)
Show columns¶
df.printSchema()
Hands-On 2 - Spark Transformations - select, add, rename and drop columns¶
Select dataframe columns
# select single column
df.select("Country").show(2)
Select multiple columns
df.select("StockCode","Description","UnitPrice").show(n=2, truncate=False)
df.columns
# select first 5 columns
df.select(df.columns[0:5]).show(2)
# selects all the original columns and adds a new column that specifies high value item
(df.selectExpr(
"*", # all original columns
"(UnitPrice > 100) as HighValueItem")
.show(2)
)
# selects all the original columns and adds a new column that specifies high value item
(df.selectExpr(
"sum(Quantity) as TotalQuantity",
"cast(sum(UnitPrice) as int) as InventoryValue")
.show()
)
Adding, renaming and dropping columns¶
# add a new column called InvoiceValue
from pyspark.sql.functions import expr
df_1 = (df
.withColumn("InvoiceValue", expr("UnitPrice * Quantity"))
.select("InvoiceNo","Description","UnitPrice","Quantity","InvoiceValue")
)
df_1.show(2, False)
# rename InvoiceValue to LineTotal
df_2 = df_1.withColumnRenamed("InvoiceValue","LineTotal")
df_2.show(2, False)
# drop a column
df_2.drop("LineTotal").show(2, False)
Hands-On 3 - Spark Transformations - filter, sort and cast¶
from pyspark.sql.functions import col
# select invoice lines with quantity > 50 and unitprice > 20
df.where(col("Quantity") > 20).where(col("UnitPrice") > 50).show(2)
df.filter(df.Quantity > 20).filter(df.UnitPrice > 50).show(2)
df.filter("Quantity > 20 and UnitPrice > 50").show(2)
# select invoice lines with quantity > 100 or unitprice > 20
df.where((col("Quantity") > 100) | (col("UnitPrice") > 20)).show(2)
from pyspark.sql.functions import desc, asc
# sort in the default order: ascending
df.orderBy(expr("UnitPrice")).show(2)
df.orderBy(col("Quantity").desc(), col("UnitPrice").asc()).show(10)
Hands-On 4 - Spark Transformations - aggregations¶
full list of built int functions - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions
%%time
# Count distinct customers
from pyspark.sql.functions import countDistinct
df.select(countDistinct("CustomerID")).show()
%%time
# approx. distinct stock items
from pyspark.sql.functions import approx_count_distinct
df.select(approx_count_distinct("CustomerID", 0.1)).show()
# average, maximum and minimum purchase quantity
from pyspark.sql.functions import avg, max, min
( df.select(
avg("Quantity").alias("avg_purchases"),
max("Quantity").alias("max_purchases"),
min("Quantity").alias("min_purchases"))
.show()
)
Hands-On 5 - Spark Transformations - grouping and windows¶
# count of items on the invoice
df.groupBy("InvoiceNo", "CustomerId").count().show(5)
# grouping with expressions
df.groupBy("InvoiceNo").agg(expr("avg(Quantity)"),expr("stddev_pop(Quantity)"))\
.show(5)
Read the csv file into DataFrame¶
%%time
is an iPython magic https://ipython.readthedocs.io/en/stable/interactive/magics.html
It's possible to read files without specifying the schema. Some file formats (Parquet is one of them) include the schema, which means that Spark can start reading the file. For format without schema (csv, json...) Spark can infer the schema. Let's see what's the difference in terms of time and of results:
online_retail_schema="InvoiceNo int, StockCode string, Description string, Quantity int,\
InvoiceDate timestamp,UnitPrice float,CustomerId int, Country string"
%%time
df = spark.read \
.option("header", "true") \
.option("timestampFormat", "M/d/yyyy H:m")\
.csv("../data/online-retail-dataset.csv.gz",
schema=online_retail_schema)
%%time
df_infer = spark.read \
.option("header", "true") \
.option("inferSchema", "true") \
.csv("../data/online-retail-dataset.csv.gz")
Exercises¶
Reminder: documentation at https://spark.apache.org/docs/latest/api/python/index.html
If you didn't run the previous cells, run the following one:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local[*]") \
.appName("DataFrame HandsOn 1") \
.config("spark.ui.showConsoleProgress","false") \
.getOrCreate()
online_retail_schema="InvoiceNo int, StockCode string, Description string, Quantity int,\
InvoiceDate timestamp,UnitPrice float,CustomerId int, Country string"
df = spark.read \
.option("header", "true") \
.option("timestampFormat", "M/d/yyyy H:m")\
.csv("../data/online-retail-dataset.csv.gz",
schema=online_retail_schema)
Task: Show 5 lines of the "description" column
Task: Count the number of distinct invoices in the dataframe
Task: Find out in which month most invoices have been issued
Task: Filter the lines where the Quantity is more than 30
Task: Show the four most sold items (by quantity)
Bonus question: why do these two operations return different results? Hint: look at the documentation
print(df.select("InvoiceNo").distinct().count())
from pyspark.sql.functions import countDistinct
df.select(countDistinct("InvoiceNo")).show()