LangChain_OpenSearch_semantic_search_with_Vector_DB.ipynb Open in SWAN Download

Semantic Search with Vector Databases and LLM

This example is about implementing a basic example of Semantic Search.
The technology is now easily available by combining frameworks and models easily available and for the most part also available as open software/resources, as well as cloud services with a subscription.
Semantic search can be applied to querying a set of documents. In this example we will use just one pdf document for simplicity, the article "A Roadmap for HEP Software and Computing R&D for the 2020s"

The implementation steps are:

  1. Take an example document and split it in chunks
  2. Create embeddings for each document chunk
  3. Store the embeddings in a Vector Database
  4. Perform semantic search using embeddings
  5. Transform the results of the search into natural language using a Large Language Model
In [1]:
# This requires langchain and pypdf, pip install if not already available
# !pip install langchain
# !pip install pypdf
In [1]:
# Download the document used in this exmaple,
# the article "A Roadmap for HEP Software and Computing R&D for the 2020s"
# see https://arxiv.org/abs/1712.06982

# Download a copy of the document and save it as WLCG_roadmap.pdf:
! curl https://arxiv.org/pdf/1712.06982 -o WLCG_roadmap.pdf
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  833k  100  833k    0     0  2396k      0 --:--:-- --:--:-- --:--:-- 2403k
In [2]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("WLCG_roadmap.pdf")
pages = loader.load_and_split()

2. Create embeddings for each document chunk

In [3]:
# !pip install sentence_transformers
In [ ]:
from langchain.embeddings import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

3. Store the embeddings in a Vector Database

Figure1

Option 1 (small data), This uses FAISS as Vector Database

In [4]:
# This example uses FAISS and in-memory 
# !pip install faiss-cpu
In [5]:
from langchain.vectorstores import FAISS

# Create the embeddings and store them in an in-memory DB with FAISS
faiss_index = FAISS.from_documents(pages, embeddings)

# Optionall save the index
faiss_index.save_local("faiss_index")
In [11]:
# This is how you can load in the index with embeddings saved to a file
# for future runs of the notebook

# from langchain.vectorstores import FAISS
# faiss_index = FAISS.load_local("faiss_index", embeddings)

Option 2 (large data), This uses Open Search as Vector Database

Use this opition instead of FAISS if you have Open Search configured as vector DB

In [4]:
# This example uses Open Search as remote vector database
# !pip install opensearch-py
In [ ]:
# This creates the embeddings and stored them in an Open Search index
# For future runs of the notebook, you can skip this and link to 
# the Open Search index directly

from langchain.vectorstores import OpenSearchVectorSearch
from getpass import getpass

# Contact Open Search service at CERN to get an instance and the credentials
opensearch_url="https://es-testspark1.cern.ch:443/es"
opensearch_user="test1"
opensearch_pass = getpass()


# perform the embeddings and store in OpenSearch
docsearch = OpenSearchVectorSearch.from_documents(
     documents=pages, 
     embedding=embeddings, 
     index_name="embd1",
     opensearch_url=opensearch_url, 
     http_auth=(opensearch_user, opensearch_pass),     
     use_ssl = True,
     verify_certs = False,
     ssl_assert_hostname = False,
     ssl_show_warn = False
)
In [7]:
# This is how you can load in the index with embeddings stored to Open Search
# for future runs of the notebook

from langchain.vectorstores import OpenSearchVectorSearch
from getpass import getpass

# Open Search instance and the credentials
opensearch_url="https://es-testspark1.cern.ch:443/es"
opensearch_user="test1"
opensearch_pass = getpass()


# use pre-loaded embeddings in OpenSearch
docsearch = OpenSearchVectorSearch(
     embedding_function=embeddings, 
     index_name="embd1",
     opensearch_url=opensearch_url, 
     http_auth=(opensearch_user, opensearch_pass),     
     use_ssl = True,
     verify_certs = False,
     ssl_assert_hostname = False,
     ssl_show_warn = False
)

4. Perform semantic search using embeddings

Figure2

Choose the index you have created and want to use for this (FAISS or Open Search)

In [6]:
# Choose the index (FAISS or Open Search)

index = faiss_index # use FAISS in-memory index
# index = docsearch # use OpenSearch Index
In [7]:
# Perform a simple similarity search

query = "How will computing evolve in the next decade with LHC high luminosity?"

found_docs = index.similarity_search(query, k=2)

found_docs
for doc in found_docs:
    print(str(doc.metadata["page"]) + ":", doc.page_content[:300])
37: the same data volumes as ATLAS. The HL-LHC storage requirements per year are
expected to jump by a factor close to 10, which is a growth rate faster than can
be accommodated by projected technology gains. Storage will remain one of the
major cost drivers for HEP computing, at a level roughly equal t
4: and the nuclear matter in the universe today. The ALICE experiment at the LHC [14]
and the CBM [15] and PANDA [16] experiments at the Facility for Antiproton and
Ion Research (FAIR) are specically designed to probe this aspect of nuclear and
particle physics. In addition ATLAS, CMS and LHCb all con

5. Transform the results of the search into natural language using a Large Language Model

In [8]:
# OpenAI
#! pip install openai

from langchain.llms import OpenAI
import os

from getpass import getpass
OPENAI_API_KEY = getpass()

os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

# This will use OpenAI models, the specified model, see also https://openai.com/pricing
model = "gpt-3.5-turbo-instruct"
llm=OpenAI(model=model)
········
In [9]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.as_retriever(search_type="similarity", search_kwargs={"k":2}), 
    return_source_documents=True)
In [10]:
query = "How will computing evolve in the next decade with LHC high luminosity?"

result = qa({"query": query})
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
In [17]:
result['result']
Out[17]:
' Computing will likely evolve to include advanced techniques such as machine learning and high rate data query systems to meet the computational constraints and extend the physics reach. This will also require more dynamic data management and access systems, as well as specialised processor resources such as GPUs.'
In [ ]: