Query_Oracle.ipynb Open in SWAN Download

How to Connect and Query Oracle for Python

This provides and example of how to use Python and Jupyer notebooks on SWAN to connect and query Oracle databases at CERN. The example provided is about generating frequency histograms using Oracle SQL.

Setup and prerequisites

Prerquites: you need to have credential to connect to an Oracle database. It is recommended that you use a test account, for example on devdb19. Contact Oracle service at CERN to request an account if needed.

In [ ]:
# oracledb is the Python driver for Oracle
# it is the next generation from the legacy cx_Oracle

# If oracledb is not yet installed, install it and restart the notebook
! pip install oracledb

import oracledb
In [ ]:
# Edit DB credentials
import getpass

db_user = 'myaccount' #edit

# To avoid storing connection passwords use getpas or db_config
db_pass = getpass.getpass()

db_connect_string = 'devdb19'
In [3]:
# When using oracledb with CERN Oracle services you need to set it in client mode

oracledb.init_oracle_client()

Create the test table

In [4]:
with oracledb.connect(user=db_user, password=db_pass, dsn=db_connect_string) as ora_conn:

    cursor = ora_conn.cursor()

    # use this drop statement if you need to recreate the table
    cursor.execute("drop table data")

    # cursor.execute("begin dbms_random.seed(4242); end;")

    cursor.execute("""
        create table data as 
        select dbms_random.value * 100 random_value 
        from dual connect by level <=100
        """)

Define the query to compute the histogram

In [5]:
table_name = "data" # table or temporary view containing the data
value_col = "random_value" # column name on which to compute the histogram
min = -20  # min: minimum value in the histogram
max = 90   # maximum value in the histogram
bins = 11  # number of histogram buckets to compute
step = (max - min) / bins


query = f"""
with bucketized as (
  select width_bucket({value_col}, {min}, {max}, {bins}) as bucket
  from {table_name}
),
hist as (
  select bucket, count(*) as cnt
  from bucketized
  group by bucket
),
buckets as (
  select rownum as bucket from dual connect by level <= {bins}
)
select
    bucket, {min} + (bucket - 1/2) * {step} as value,
    nvl(cnt, 0) as count
from hist right outer join buckets using(bucket)
order by bucket
"""

Fetch the histogram data into a Pandas dataframe using SQLAlchemy

In [6]:
import pandas as pd
from sqlalchemy import create_engine

# create SQLAlchemy engine
engine = create_engine(f"oracle+oracledb://{db_user}:{db_pass}@{db_connect_string}")
In [7]:
# query Oracle using ora_conn and put the result into a pandas Dataframe
hist_pandasDF = pd.read_sql(query, engine)

# Decription of the resulting dataframe
#
# BUCKET: the bucket number, range from 1 to bins (included)
# VALUE: midpoint value of the given bucket
# COUNT: number of values in the bucket 

hist_pandasDF
Out[7]:
bucket value count
0 1 -15 0
1 2 -5 0
2 3 5 9
3 4 15 7
4 5 25 12
5 6 35 13
6 7 45 7
7 8 55 11
8 9 65 8
9 10 75 13
10 11 85 11
In [8]:
# Optionally normalize the event count into a frequency
# dividing by the total number of events

hist_pandasDF["frequency"] = hist_pandasDF["count"] / sum(hist_pandasDF["count"]) 

hist_pandasDF
Out[8]:
bucket value count frequency
0 1 -15 0 0.000000
1 2 -5 0 0.000000
2 3 5 9 0.098901
3 4 15 7 0.076923
4 5 25 12 0.131868
5 6 35 13 0.142857
6 7 45 7 0.076923
7 8 55 11 0.120879
8 9 65 8 0.087912
9 10 75 13 0.142857
10 11 85 11 0.120879

Histogram plotting

The first plot is a histogram with the event counts (number of events per bin). The second plot is a histogram of the events frequencies (number of events per bin normalized by the sum of the events).

In [9]:
import matplotlib.pyplot as plt 

plt.rcParams.update({'font.size': 20, 'figure.figsize': [14,10]})

f, ax = plt.subplots()

# histogram data
x = hist_pandasDF["value"]
y = hist_pandasDF["count"]

# bar plot
ax.bar(x, y, width = 3.0, color='red')

ax.set_xlabel("Bucket values")
ax.set_ylabel("Event count")
ax.set_title("Distribution of event counts")

# Label for the resonances spectrum peaks
txt_opts = {'horizontalalignment': 'center',
            'verticalalignment': 'center',
            'transform': ax.transAxes}

plt.show()
No description has been provided for this image
In [10]:
import matplotlib.pyplot as plt 
plt.rcParams.update({'font.size': 20, 'figure.figsize': [14,10]})

f, ax = plt.subplots()

# histogram data
x = hist_pandasDF["value"]
y = hist_pandasDF["frequency"]

# bar plot
ax.bar(x, y, width = 3.0, color='blue')

ax.set_xlabel("Bucket values")
ax.set_ylabel("Event frequency")
ax.set_title("Distribution of event frequencies")

# Label for the resonances spectrum peaks
txt_opts = {'horizontalalignment': 'center',
            'verticalalignment': 'center',
            'transform': ax.transAxes}

plt.show()
No description has been provided for this image
In [ ]: