{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## This notebook is part of the Apache Spark training delivered by CERN-IT\n", "### Spark SQL Hands-On Lab with Solutions\n", "Contact: Luca.Canali@cern.ch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run this notebook from Jupyter with Python kernel\n", "- When using on CERN SWAN, do not attach the notebook to a Spark cluster, but rather run locally on the SWAN container\n", "- If running this outside CERN SWAN, plese make sure to tha PySpark installed: `pip install pyspark`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Examples datasets\n", "The following examples use sample data provided in the repository. \n", "We will use the movielens dataset from Kaggle, credits: https://www.kaggle.com/grouplens/movielens-20m-dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create Spark Session, you need this to work with Spark\n", "from pyspark.sql import SparkSession\n", "spark = SparkSession.builder \\\n", " .appName(\"My spark example app\") \\\n", " .master(\"local[*]\") \\\n", " .config(\"spark.driver.memory\",\"8g\") \\\n", " .config(\"spark.ui.showConsoleProgress\", \"false\") \\\n", " .getOrCreate()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
SparkSession - in-memory
\n", " \n", "SparkContext
\n", "\n", " \n", "\n", "v3.3.1
local[*]
My spark example app
\n", " | title | \n", "count(1) | \n", "
---|---|---|
0 | \n", "Forrest Gump (1994) | \n", "45782 | \n", "
1 | \n", "Shawshank Redemption, The (1994) | \n", "45546 | \n", "
2 | \n", "Pulp Fiction (1994) | \n", "43755 | \n", "
3 | \n", "Silence of the Lambs, The (1991) | \n", "41807 | \n", "
4 | \n", "Matrix, The (1999) | \n", "38860 | \n", "
\n", " | title | \n", "avg_rating | \n", "count | \n", "
---|---|---|---|
0 | \n", "Planet Earth (2006) | \n", "4.467391 | \n", "368 | \n", "
1 | \n", "Band of Brothers (2001) | \n", "4.431655 | \n", "139 | \n", "
2 | \n", "Shawshank Redemption, The (1994) | \n", "4.426338 | \n", "45546 | \n", "
3 | \n", "Godfather, The (1972) | \n", "4.335648 | \n", "28582 | \n", "
4 | \n", "Usual Suspects, The (1995) | \n", "4.299494 | \n", "29635 | \n", "
5 | \n", "Godfather: Part II, The (1974) | \n", "4.266718 | \n", "18319 | \n", "
6 | \n", "Seven Samurai (Shichinin no samurai) (1954) | \n", "4.265507 | \n", "6900 | \n", "
7 | \n", "Schindler's List (1993) | \n", "4.261945 | \n", "33780 | \n", "
8 | \n", "The Blue Planet (2001) | \n", "4.234615 | \n", "130 | \n", "
9 | \n", "Fight Club (1999) | \n", "4.232034 | \n", "29931 | \n", "
10 | \n", "One Flew Over the Cuckoo's Nest (1975) | \n", "4.230852 | \n", "19937 | \n", "
11 | \n", "12 Angry Men (1957) | \n", "4.229520 | \n", "8374 | \n", "
12 | \n", "Rear Window (1954) | \n", "4.229511 | \n", "10542 | \n", "
13 | \n", "Paths of Glory (1957) | \n", "4.218140 | \n", "2150 | \n", "
14 | \n", "Casablanca (1942) | \n", "4.215292 | \n", "14903 | \n", "
15 | \n", "Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) | \n", "4.214214 | \n", "3975 | \n", "
16 | \n", "North by Northwest (1959) | \n", "4.211488 | \n", "9445 | \n", "
17 | \n", "Third Man, The (1949) | \n", "4.210196 | \n", "3825 | \n", "
18 | \n", "Spirited Away (Sen to Chihiro no kamikakushi) ... | \n", "4.209656 | \n", "10398 | \n", "
19 | \n", "Dr. Strangelove or: How I Learned to Stop Worr... | \n", "4.209441 | \n", "13992 | \n", "