Semantic Search with Vector Databases and LLM¶

This example is about implementing a basic example of Semantic Search.
The technology is now easily available by combining frameworks and models easily available and for the most part also available as open software/resources, as well as cloud services with a subscription.
Semantic search can be applied to querying a set of documents. In this example we will use just one pdf document for simplicity, the article "A Roadmap for HEP Software and Computing R&D for the 2020s"

The implementation steps are:

Take an example document and split it in chunks
Create embeddings for each document chunk
Store the embeddings in a Vector Database
Perform semantic search using embeddings
Transform the results of the search into natural language using a Large Language Model

In [1]:

# This requires langchain and pypdf, pip install if not already available
# !pip install langchain
# !pip install pypdf

In [1]:

# Download the document used in this exmaple,
# the article "A Roadmap for HEP Software and Computing R&D for the 2020s"
# see https://arxiv.org/abs/1712.06982

# Download a copy of the document and save it as WLCG_roadmap.pdf:
! curl https://arxiv.org/pdf/1712.06982 -o WLCG_roadmap.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  833k  100  833k    0     0  2396k      0 --:--:-- --:--:-- --:--:-- 2403k

In [2]:

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("WLCG_roadmap.pdf")
pages = loader.load_and_split()

2. Create embeddings for each document chunk¶

In [3]:

# !pip install sentence_transformers

In [ ]:

from langchain.embeddings import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

3. Store the embeddings in a Vector Database¶

Option 1 (small data), This uses FAISS as Vector Database¶

In [4]:

# This example uses FAISS and in-memory 
# !pip install faiss-cpu

In [5]:

from langchain.vectorstores import FAISS

# Create the embeddings and store them in an in-memory DB with FAISS
faiss_index = FAISS.from_documents(pages, embeddings)

# Optionall save the index
faiss_index.save_local("faiss_index")

In [11]:

# This is how you can load in the index with embeddings saved to a file
# for future runs of the notebook

# from langchain.vectorstores import FAISS
# faiss_index = FAISS.load_local("faiss_index", embeddings)

Option 2 (large data), This uses Open Search as Vector Database¶

Use this opition instead of FAISS if you have Open Search configured as vector DB

In [4]:

# This example uses Open Search as remote vector database
# !pip install opensearch-py

In [ ]:

# This creates the embeddings and stored them in an Open Search index
# For future runs of the notebook, you can skip this and link to 
# the Open Search index directly

from langchain.vectorstores import OpenSearchVectorSearch
from getpass import getpass

# Contact Open Search service at CERN to get an instance and the credentials
opensearch_url="https://es-testspark1.cern.ch:443/es"
opensearch_user="test1"
opensearch_pass = getpass()


# perform the embeddings and store in OpenSearch
docsearch = OpenSearchVectorSearch.from_documents(
     documents=pages, 
     embedding=embeddings, 
     index_name="embd1",
     opensearch_url=opensearch_url, 
     http_auth=(opensearch_user, opensearch_pass),     
     use_ssl = True,
     verify_certs = False,
     ssl_assert_hostname = False,
     ssl_show_warn = False
)

In [7]:

# This is how you can load in the index with embeddings stored to Open Search
# for future runs of the notebook

from langchain.vectorstores import OpenSearchVectorSearch
from getpass import getpass

# Open Search instance and the credentials
opensearch_url="https://es-testspark1.cern.ch:443/es"
opensearch_user="test1"
opensearch_pass = getpass()


# use pre-loaded embeddings in OpenSearch
docsearch = OpenSearchVectorSearch(
     embedding_function=embeddings, 
     index_name="embd1",
     opensearch_url=opensearch_url, 
     http_auth=(opensearch_user, opensearch_pass),     
     use_ssl = True,
     verify_certs = False,
     ssl_assert_hostname = False,
     ssl_show_warn = False
)

4. Perform semantic search using embeddings¶

Choose the index you have created and want to use for this (FAISS or Open Search)¶

In [6]:

# Choose the index (FAISS or Open Search)

index = faiss_index # use FAISS in-memory index
# index = docsearch # use OpenSearch Index

In [7]:

# Perform a simple similarity search

query = "How will computing evolve in the next decade with LHC high luminosity?"

found_docs = index.similarity_search(query, k=2)

found_docs
for doc in found_docs:
    print(str(doc.metadata["page"]) + ":", doc.page_content[:300])

37: the same data volumes as ATLAS. The HL-LHC storage requirements per year are
expected to jump by a factor close to 10, which is a growth rate faster than can
be accommodated by projected technology gains. Storage will remain one of the
major cost drivers for HEP computing, at a level roughly equal t
4: and the nuclear matter in the universe today. The ALICE experiment at the LHC [14]
and the CBM [15] and PANDA [16] experiments at the Facility for Antiproton and
Ion Research (FAIR) are specically designed to probe this aspect of nuclear and
particle physics. In addition ATLAS, CMS and LHCb all con

5. Transform the results of the search into natural language using a Large Language Model¶

In [8]:

# OpenAI
#! pip install openai

from langchain.llms import OpenAI
import os

from getpass import getpass
OPENAI_API_KEY = getpass()

os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

# This will use OpenAI models, the specified model, see also https://openai.com/pricing
model = "gpt-3.5-turbo-instruct"
llm=OpenAI(model=model)

········

In [9]:

from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.as_retriever(search_type="similarity", search_kwargs={"k":2}), 
    return_source_documents=True)

In [10]:

query = "How will computing evolve in the next decade with LHC high luminosity?"

result = qa({"query": query})

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

In [17]:

result['result']

Out[17]:

' Computing will likely evolve to include advanced techniques such as machine learning and high rate data query systems to meet the computational constraints and extend the physics reach. This will also require more dynamic data management and access systems, as well as specialised processor resources such as GPUs.'

In [ ]: