LangChain_OpenSearch_semantic_search_with_Vector_DB.ipynb Open in SWAN Download

Semantic Search with Vector Databases and LLM

This example is about implementing a basic example of Semantic Search.
The technology is now easily available by combining frameworks and models easily available and for the most part also available as open software/resources, as well as cloud services with a subscription.
Semantic search can be applied to querying a set of documents. In this example we will use just one pdf document for simplicity, the article "A Roadmap for HEP Software and Computing R&D for the 2020s"

The implementation steps are:

  1. Take an example document and split it in chunks
  2. Create embeddings for each document chunk
  3. Store the embeddings in a Vector Database
  4. Perform semantic search using embeddings
  5. Transform the results of the search into natural language using a Large Language Model
In [1]:
# This requires langchain and pypdf, pip install if not already available
# !pip install langchain
# !pip install pypdf
In [1]:
# Download the document used in this exmaple,
# the article "A Roadmap for HEP Software and Computing R&D for the 2020s"
# see https://arxiv.org/abs/1712.06982

# Download a copy of the document and save it as WLCG_roadmap.pdf:
! curl https://arxiv.org/pdf/1712.06982.pdf -o WLCG_roadmap.pdf
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  833k  100  833k    0     0   336k      0  0:00:02  0:00:02 --:--:--  336k
In [2]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("WLCG_roadmap.pdf")
pages = loader.load_and_split()

2. Create embeddings for each document chunk

In [2]:
# !pip install sentence_transformers
In [ ]:
from langchain.embeddings import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

3. Store the embeddings in a Vector Database

Figure1

Option 1 (small data), use FAISS as Vector Database

In [4]:
# This example uses FAISS and in-memory 
# !pip install faiss-cpu
In [ ]:
from langchain.vectorstores import FAISS

# Create the embeddings and store them in an in-memory DB with FAISS
faiss_index = FAISS.from_documents(pages, embeddings)

# Optionall save the index
faiss_index.save_local("faiss_index")
In [ ]:
# This is how you can load in the index with embeddings saved to a file
# for future runs of the notebook

# from langchain.vectorstores import FAISS
# faiss_index = FAISS.load_local("faiss_index", embeddings)

Option 2 (large data), use Open Search as Vector Database

In [4]:
# This example uses Open Search as remote vector database
# !pip install opensearch-py
In [6]:
# This creates the embeddings and stored them in an Open Search index
# For future runs of the notebook, you can skip this and link to 
# the Open Search index directly

from langchain.vectorstores import OpenSearchVectorSearch
from getpass import getpass

# Contact Open Search service at CERN to get an instance and the credentials
opensearch_url="https://es-testspark1.cern.ch:443/es"
opensearch_user="test1"
opensearch_pass = getpass()


# perform the embeddings and store in OpenSearch
docsearch = OpenSearchVectorSearch.from_documents(
     documents=pages, 
     embedding=embeddings, 
     index_name="embd1",
     opensearch_url=opensearch_url, 
     http_auth=(opensearch_user, opensearch_pass),     
     use_ssl = True,
     verify_certs = False,
     ssl_assert_hostname = False,
     ssl_show_warn = False
)
In [7]:
# This is how you can load in the index with embeddings stored to Open Search
# for future runs of the notebook

from langchain.vectorstores import OpenSearchVectorSearch
from getpass import getpass

# Open Search instance and the credentials
opensearch_url="https://es-testspark1.cern.ch:443/es"
opensearch_user="test1"
opensearch_pass = getpass()


# use pre-loaded embeddings in OpenSearch
docsearch = OpenSearchVectorSearch(
     embedding_function=embeddings, 
     index_name="embd1",
     opensearch_url=opensearch_url, 
     http_auth=(opensearch_user, opensearch_pass),     
     use_ssl = True,
     verify_certs = False,
     ssl_assert_hostname = False,
     ssl_show_warn = False
)

4. Perform semantic search using embeddings

Figure2

In [8]:
# Choose the index you have created and want to use for this (FAISS or Open Search)
# index = faiss_index # use FAISS in-memory index

index = docsearch # use OpenSearch Index
In [10]:
# Perform a simple similarity search

query = "How will computing evolve in the next decade with LHC high luminosity?"

found_docs = index.similarity_search(query, k=2)

found_docs
for doc in found_docs:
    print(str(doc.metadata["page"]) + ":", doc.page_content[:300])
37: the same data volumes as ATLAS. The HL-LHC storage requirements per year are
expected to jump by a factor close to 10, which is a growth rate faster than can
be accommodated by projected technology gains. Storage will remain one of the
major cost drivers for HEP computing, at a level roughly equal t
37: the same data volumes as ATLAS. The HL-LHC storage requirements per year are
expected to jump by a factor close to 10, which is a growth rate faster than can
be accommodated by projected technology gains. Storage will remain one of the
major cost drivers for HEP computing, at a level roughly equal t

5. Transform the results of the search into natural language using a Large Language Model

In [11]:
# OpenAI
#! pip install openai

from langchain.llms import OpenAI
import os

from getpass import getpass
OPENAI_API_KEY = getpass()

os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

llm=OpenAI(temperature=0)
········
In [12]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.as_retriever(search_type="similarity", search_kwargs={"k":2}), 
    return_source_documents=True)
In [13]:
query = "How will computing evolve in the next decade with LHC high luminosity?"

result = qa({"query": query})
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
In [14]:
result['result']
Out[14]:
' Computing will need to evolve to handle the increased data rate and volume, as well as the increased computational requirements. This will likely involve shifts in data presentation and analysis models, such as the use of event-based data streaming, and the use of new types of computing resources, such as cloud and HPC. New applications, such as training for machine learning, may also be employed to meet the computational constraints and extend physics reach.'
In [ ]: