{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## This notebook is part of the Spark training delivered by CERN IT\n", "### Regression with spark.ml\n", "Contact: Luca.Canali@cern.ch\n", "\n", "This notebook is an implementation of a regression system trained using `spark.ml` to predict house prices.\n", "\n", "The data used for this exercise is the \"California Housing Prices dataset\" from the StatLib repository, originally featured in the following paper: Pace, R. Kelley, and Ronald Barry. \"Sparse spatial autoregressions.\" Statistics & Probability Letters 33.3 (1997): 291-297.\n", "The code and steps we follow in this notebook are inspired by the book \"Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, Aurelien Geron, 2nd Edition\".\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run this notebook from Jupyter with Python kernel\n", "- When using on CERN SWAN, do not attach the notebook to a Spark cluster, but rather run locally on the SWAN container\n", "- If running this outside CERN SWAN, plese make sure to tha PySpark installed: `pip install pyspark`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create the Spark session and read the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#\n", "# Local mode: run this when using CERN SWAN not connected to a cluster \n", "# or run it on a private Jupyter notebook instance\n", "# Dependency: PySpark (use SWAN or pip install pyspark)\n", "#\n", "\n", "from pyspark.sql import SparkSession\n", "spark = SparkSession.builder \\\n", " .master(\"local[*]\") \\\n", " .appName(\"ML HandsOn Regression\") \\\n", " .config(\"spark.driver.memory\",\"4g\") \\\n", " .config(\"spark.ui.showConsoleProgress\", \"false\") \\\n", " .getOrCreate()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
SparkSession - in-memory
\n", " \n", "SparkContext
\n", "\n", " \n", "\n", "v3.3.1
local[*]
ML HandsOn Regression
\n", " | longitude | \n", "latitude | \n", "housing_median_age | \n", "total_rooms | \n", "total_bedrooms | \n", "population | \n", "households | \n", "median_income | \n", "median_house_value | \n", "ocean_proximity | \n", "
---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "-124.35 | \n", "40.54 | \n", "52.0 | \n", "1820.0 | \n", "300.0 | \n", "806.0 | \n", "270.0 | \n", "3.0147 | \n", "94600.0 | \n", "NEAR OCEAN | \n", "
1 | \n", "-124.30 | \n", "41.80 | \n", "19.0 | \n", "2672.0 | \n", "552.0 | \n", "1298.0 | \n", "478.0 | \n", "1.9797 | \n", "85800.0 | \n", "NEAR OCEAN | \n", "
2 | \n", "-124.30 | \n", "41.84 | \n", "17.0 | \n", "2677.0 | \n", "531.0 | \n", "1244.0 | \n", "456.0 | \n", "3.0313 | \n", "103600.0 | \n", "NEAR OCEAN | \n", "
3 | \n", "-124.27 | \n", "40.69 | \n", "36.0 | \n", "2349.0 | \n", "528.0 | \n", "1194.0 | \n", "465.0 | \n", "2.5179 | \n", "79000.0 | \n", "NEAR OCEAN | \n", "
4 | \n", "-124.26 | \n", "40.58 | \n", "52.0 | \n", "2217.0 | \n", "394.0 | \n", "907.0 | \n", "369.0 | \n", "2.3571 | \n", "111400.0 | \n", "NEAR OCEAN | \n", "
5 | \n", "-124.25 | \n", "40.28 | \n", "32.0 | \n", "1430.0 | \n", "419.0 | \n", "434.0 | \n", "187.0 | \n", "1.9417 | \n", "76100.0 | \n", "NEAR OCEAN | \n", "
6 | \n", "-124.23 | \n", "40.54 | \n", "52.0 | \n", "2694.0 | \n", "453.0 | \n", "1152.0 | \n", "435.0 | \n", "3.0806 | \n", "106700.0 | \n", "NEAR OCEAN | \n", "
7 | \n", "-124.23 | \n", "41.75 | \n", "11.0 | \n", "3159.0 | \n", "616.0 | \n", "1343.0 | \n", "479.0 | \n", "2.4805 | \n", "73200.0 | \n", "NEAR OCEAN | \n", "
8 | \n", "-124.22 | \n", "41.73 | \n", "28.0 | \n", "3003.0 | \n", "699.0 | \n", "1530.0 | \n", "653.0 | \n", "1.7038 | \n", "78300.0 | \n", "NEAR OCEAN | \n", "
9 | \n", "-124.21 | \n", "40.75 | \n", "32.0 | \n", "1218.0 | \n", "331.0 | \n", "620.0 | \n", "268.0 | \n", "1.6528 | \n", "58100.0 | \n", "NEAR OCEAN | \n", "
\n", " | longitude | \n", "latitude | \n", "housing_median_age | \n", "total_rooms | \n", "total_bedrooms | \n", "population | \n", "households | \n", "median_income | \n", "median_house_value | \n", "ocean_proximity | \n", "indexed_ocean_proximity | \n", "oh_ocean_proximity | \n", "total_bedrooms_filled | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "-124.35 | \n", "40.54 | \n", "52.0 | \n", "1820.0 | \n", "300.0 | \n", "806.0 | \n", "270.0 | \n", "3.0147 | \n", "94600.0 | \n", "NEAR OCEAN | \n", "2.0 | \n", "(0.0, 0.0, 1.0, 0.0, 0.0) | \n", "300.0 | \n", "
1 | \n", "-124.30 | \n", "41.80 | \n", "19.0 | \n", "2672.0 | \n", "552.0 | \n", "1298.0 | \n", "478.0 | \n", "1.9797 | \n", "85800.0 | \n", "NEAR OCEAN | \n", "2.0 | \n", "(0.0, 0.0, 1.0, 0.0, 0.0) | \n", "552.0 | \n", "
2 | \n", "-124.30 | \n", "41.84 | \n", "17.0 | \n", "2677.0 | \n", "531.0 | \n", "1244.0 | \n", "456.0 | \n", "3.0313 | \n", "103600.0 | \n", "NEAR OCEAN | \n", "2.0 | \n", "(0.0, 0.0, 1.0, 0.0, 0.0) | \n", "531.0 | \n", "
3 | \n", "-124.27 | \n", "40.69 | \n", "36.0 | \n", "2349.0 | \n", "528.0 | \n", "1194.0 | \n", "465.0 | \n", "2.5179 | \n", "79000.0 | \n", "NEAR OCEAN | \n", "2.0 | \n", "(0.0, 0.0, 1.0, 0.0, 0.0) | \n", "528.0 | \n", "
4 | \n", "-124.26 | \n", "40.58 | \n", "52.0 | \n", "2217.0 | \n", "394.0 | \n", "907.0 | \n", "369.0 | \n", "2.3571 | \n", "111400.0 | \n", "NEAR OCEAN | \n", "2.0 | \n", "(0.0, 0.0, 1.0, 0.0, 0.0) | \n", "394.0 | \n", "
5 | \n", "-124.25 | \n", "40.28 | \n", "32.0 | \n", "1430.0 | \n", "419.0 | \n", "434.0 | \n", "187.0 | \n", "1.9417 | \n", "76100.0 | \n", "NEAR OCEAN | \n", "2.0 | \n", "(0.0, 0.0, 1.0, 0.0, 0.0) | \n", "419.0 | \n", "
6 | \n", "-124.23 | \n", "40.54 | \n", "52.0 | \n", "2694.0 | \n", "453.0 | \n", "1152.0 | \n", "435.0 | \n", "3.0806 | \n", "106700.0 | \n", "NEAR OCEAN | \n", "2.0 | \n", "(0.0, 0.0, 1.0, 0.0, 0.0) | \n", "453.0 | \n", "
7 | \n", "-124.23 | \n", "41.75 | \n", "11.0 | \n", "3159.0 | \n", "616.0 | \n", "1343.0 | \n", "479.0 | \n", "2.4805 | \n", "73200.0 | \n", "NEAR OCEAN | \n", "2.0 | \n", "(0.0, 0.0, 1.0, 0.0, 0.0) | \n", "616.0 | \n", "
8 | \n", "-124.22 | \n", "41.73 | \n", "28.0 | \n", "3003.0 | \n", "699.0 | \n", "1530.0 | \n", "653.0 | \n", "1.7038 | \n", "78300.0 | \n", "NEAR OCEAN | \n", "2.0 | \n", "(0.0, 0.0, 1.0, 0.0, 0.0) | \n", "699.0 | \n", "
9 | \n", "-124.21 | \n", "40.75 | \n", "32.0 | \n", "1218.0 | \n", "331.0 | \n", "620.0 | \n", "268.0 | \n", "1.6528 | \n", "58100.0 | \n", "NEAR OCEAN | \n", "2.0 | \n", "(0.0, 0.0, 1.0, 0.0, 0.0) | \n", "331.0 | \n", "
\n", " | unscaled_features | \n", "features | \n", "
---|---|---|
0 | \n", "[-124.35, 40.54, 52.0, 1820.0, 806.0, 270.0, 3... | \n", "[-62.03326688689804, 18.957877035296093, 4.137... | \n", "
1 | \n", "[-124.3, 41.8, 19.0, 2672.0, 1298.0, 478.0, 1.... | \n", "[-62.00832387648916, 19.547095709802086, 1.511... | \n", "
2 | \n", "[-124.3, 41.84, 17.0, 2677.0, 1244.0, 456.0, 3... | \n", "[-62.00832387648916, 19.56580106454831, 1.3527... | \n", "
3 | \n", "[-124.27, 40.69, 36.0, 2349.0, 1194.0, 465.0, ... | \n", "[-61.99335807024382, 19.028022115594425, 2.864... | \n", "
4 | \n", "[-124.26, 40.58, 52.0, 2217.0, 907.0, 369.0, 2... | \n", "[-61.988369468162055, 18.976582390042314, 4.13... | \n", "
5 | \n", "[-124.25, 40.28, 32.0, 1430.0, 434.0, 187.0, 1... | \n", "[-61.983380866080275, 18.83629222944565, 2.546... | \n", "
6 | \n", "[-124.23, 40.54, 52.0, 2694.0, 1152.0, 435.0, ... | \n", "[-61.97340366191672, 18.957877035296093, 4.137... | \n", "
7 | \n", "[-124.23, 41.75, 11.0, 3159.0, 1343.0, 479.0, ... | \n", "[-61.97340366191672, 19.52371401636931, 0.8753... | \n", "
8 | \n", "[-124.22, 41.73, 28.0, 3003.0, 1530.0, 653.0, ... | \n", "[-61.96841505983494, 19.5143613389962, 2.22808... | \n", "
9 | \n", "[-124.21, 40.75, 32.0, 1218.0, 620.0, 268.0, 1... | \n", "[-61.96342645775316, 19.056080147713757, 2.546... | \n", "