Semantic Search with Vector Databases and LLM¶
This example is about implementing a basic example of Semantic Search.
The technology is now easily available by combining frameworks and models easily available and for the most part also available as open software/resources, as well as cloud services with a subscription.
Semantic search can be applied to querying a set of documents. In this example we will use just one pdf document for simplicity, the article "A Roadmap for HEP Software and Computing R&D for the 2020s"
The implementation steps are:
- Take an example document and split it in chunks
- Create embeddings for each document chunk
- Store the embeddings in a Vector Database
- Perform semantic search using embeddings
- Transform the results of the search into natural language using a Large Language Model
In [1]:
# This requires langchain and pypdf, pip install if not already available
# !pip install langchain
# !pip install pypdf
In [1]:
# Download the document used in this exmaple,
# the article "A Roadmap for HEP Software and Computing R&D for the 2020s"
# see https://arxiv.org/abs/1712.06982
# Download a copy of the document and save it as WLCG_roadmap.pdf:
! curl https://arxiv.org/pdf/1712.06982 -o WLCG_roadmap.pdf
In [2]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("WLCG_roadmap.pdf")
pages = loader.load_and_split()
2. Create embeddings for each document chunk¶
In [3]:
# !pip install sentence_transformers
In [ ]:
from langchain.embeddings import HuggingFaceEmbeddings
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}
embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)
3. Store the embeddings in a Vector Database¶
Option 1 (small data), This uses FAISS as Vector Database¶
In [4]:
# This example uses FAISS and in-memory
# !pip install faiss-cpu
In [5]:
from langchain.vectorstores import FAISS
# Create the embeddings and store them in an in-memory DB with FAISS
faiss_index = FAISS.from_documents(pages, embeddings)
# Optionall save the index
faiss_index.save_local("faiss_index")
In [11]:
# This is how you can load in the index with embeddings saved to a file
# for future runs of the notebook
# from langchain.vectorstores import FAISS
# faiss_index = FAISS.load_local("faiss_index", embeddings)
Option 2 (large data), This uses Open Search as Vector Database¶
Use this opition instead of FAISS if you have Open Search configured as vector DB
In [4]:
# This example uses Open Search as remote vector database
# !pip install opensearch-py
In [ ]:
# This creates the embeddings and stored them in an Open Search index
# For future runs of the notebook, you can skip this and link to
# the Open Search index directly
from langchain.vectorstores import OpenSearchVectorSearch
from getpass import getpass
# Contact Open Search service at CERN to get an instance and the credentials
opensearch_url="https://es-testspark1.cern.ch:443/es"
opensearch_user="test1"
opensearch_pass = getpass()
# perform the embeddings and store in OpenSearch
docsearch = OpenSearchVectorSearch.from_documents(
documents=pages,
embedding=embeddings,
index_name="embd1",
opensearch_url=opensearch_url,
http_auth=(opensearch_user, opensearch_pass),
use_ssl = True,
verify_certs = False,
ssl_assert_hostname = False,
ssl_show_warn = False
)
In [7]:
# This is how you can load in the index with embeddings stored to Open Search
# for future runs of the notebook
from langchain.vectorstores import OpenSearchVectorSearch
from getpass import getpass
# Open Search instance and the credentials
opensearch_url="https://es-testspark1.cern.ch:443/es"
opensearch_user="test1"
opensearch_pass = getpass()
# use pre-loaded embeddings in OpenSearch
docsearch = OpenSearchVectorSearch(
embedding_function=embeddings,
index_name="embd1",
opensearch_url=opensearch_url,
http_auth=(opensearch_user, opensearch_pass),
use_ssl = True,
verify_certs = False,
ssl_assert_hostname = False,
ssl_show_warn = False
)
4. Perform semantic search using embeddings¶
Choose the index you have created and want to use for this (FAISS or Open Search)¶
In [6]:
# Choose the index (FAISS or Open Search)
index = faiss_index # use FAISS in-memory index
# index = docsearch # use OpenSearch Index
In [7]:
# Perform a simple similarity search
query = "How will computing evolve in the next decade with LHC high luminosity?"
found_docs = index.similarity_search(query, k=2)
found_docs
for doc in found_docs:
print(str(doc.metadata["page"]) + ":", doc.page_content[:300])
5. Transform the results of the search into natural language using a Large Language Model¶
In [8]:
# OpenAI
#! pip install openai
from langchain.llms import OpenAI
import os
from getpass import getpass
OPENAI_API_KEY = getpass()
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY
# This will use OpenAI models, the specified model, see also https://openai.com/pricing
model = "gpt-3.5-turbo-instruct"
llm=OpenAI(model=model)
In [9]:
from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=index.as_retriever(search_type="similarity", search_kwargs={"k":2}),
return_source_documents=True)
In [10]:
query = "How will computing evolve in the next decade with LHC high luminosity?"
result = qa({"query": query})
In [17]:
result['result']
Out[17]:
In [ ]: