{ "cells": [ { "cell_type": "markdown", "id": "b847596f", "metadata": {}, "source": [ "# Semantic Search with Vector Databases and LLM\n", "This example is about implementing a basic example of Semantic Search. \n", "The technology is now easily available by combining frameworks and models easily available and for the most part also available as open software/resources, as well as cloud services with a subscription. \n", "Semantic search can be applied to querying a set of documents. In this example we will use just one pdf document for simplicity, the article \"A Roadmap for HEP Software and Computing R&D for the 2020s\"\n", "\n", "The implementation steps are: \n", "1. Take an example document and split it in chunks\n", "2. Create embeddings for each document chunk\n", "3. Store the embeddings in a Vector Database\n", "4. Perform semantic search using embeddings\n", "5. Transform the results of the search into natural language using a Large Language Model\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "d3790453", "metadata": {}, "outputs": [], "source": [ "# This requires langchain and pypdf, pip install if not already available\n", "# !pip install langchain\n", "# !pip install pypdf" ] }, { "cell_type": "code", "execution_count": 1, "id": "d6bf3bd8", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 833k 100 833k 0 0 2396k 0 --:--:-- --:--:-- --:--:-- 2403k\n" ] } ], "source": [ "# Download the document used in this exmaple,\n", "# the article \"A Roadmap for HEP Software and Computing R&D for the 2020s\"\n", "# see https://arxiv.org/abs/1712.06982\n", "\n", "# Download a copy of the document and save it as WLCG_roadmap.pdf:\n", "! curl https://arxiv.org/pdf/1712.06982 -o WLCG_roadmap.pdf\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "7b536561", "metadata": {}, "outputs": [], "source": [ "from langchain.document_loaders import PyPDFLoader\n", "\n", "loader = PyPDFLoader(\"WLCG_roadmap.pdf\")\n", "pages = loader.load_and_split()" ] }, { "cell_type": "markdown", "id": "1d785f03", "metadata": {}, "source": [ "## 2. Create embeddings for each document chunk" ] }, { "cell_type": "code", "execution_count": 3, "id": "2b72d16a", "metadata": {}, "outputs": [], "source": [ "# !pip install sentence_transformers" ] }, { "cell_type": "code", "execution_count": null, "id": "f0400697", "metadata": {}, "outputs": [], "source": [ "from langchain.embeddings import HuggingFaceEmbeddings\n", "\n", "model_name = \"sentence-transformers/all-mpnet-base-v2\"\n", "model_kwargs = {\"device\": \"cuda\"}\n", "\n", "embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)" ] }, { "cell_type": "markdown", "id": "7b49b00f", "metadata": {}, "source": [ "## 3. Store the embeddings in a Vector Database\n", "![Figure1](Figure1_backend_preparation_vectorDB.png)" ] }, { "cell_type": "markdown", "id": "dc972909", "metadata": {}, "source": [ "## Option 1 (small data), This uses FAISS as Vector Database" ] }, { "cell_type": "code", "execution_count": 4, "id": "ad1a525f", "metadata": {}, "outputs": [], "source": [ "# This example uses FAISS and in-memory \n", "# !pip install faiss-cpu" ] }, { "cell_type": "code", "execution_count": 5, "id": "c0463b1d", "metadata": { "scrolled": false }, "outputs": [], "source": [ "from langchain.vectorstores import FAISS\n", "\n", "# Create the embeddings and store them in an in-memory DB with FAISS\n", "faiss_index = FAISS.from_documents(pages, embeddings)\n", "\n", "# Optionall save the index\n", "faiss_index.save_local(\"faiss_index\")" ] }, { "cell_type": "code", "execution_count": 11, "id": "9b30f666", "metadata": {}, "outputs": [], "source": [ "# This is how you can load in the index with embeddings saved to a file\n", "# for future runs of the notebook\n", "\n", "# from langchain.vectorstores import FAISS\n", "# faiss_index = FAISS.load_local(\"faiss_index\", embeddings)" ] }, { "cell_type": "markdown", "id": "b4d249a4", "metadata": {}, "source": [ "## Option 2 (large data), This uses Open Search as Vector Database\n", "Use this opition instead of FAISS if you have Open Search configured as vector DB" ] }, { "cell_type": "code", "execution_count": 4, "id": "ef0f011d", "metadata": {}, "outputs": [], "source": [ "# This example uses Open Search as remote vector database\n", "# !pip install opensearch-py" ] }, { "cell_type": "code", "execution_count": null, "id": "4d00e998", "metadata": { "scrolled": true }, "outputs": [], "source": [ "# This creates the embeddings and stored them in an Open Search index\n", "# For future runs of the notebook, you can skip this and link to \n", "# the Open Search index directly\n", "\n", "from langchain.vectorstores import OpenSearchVectorSearch\n", "from getpass import getpass\n", "\n", "# Contact Open Search service at CERN to get an instance and the credentials\n", "opensearch_url=\"https://es-testspark1.cern.ch:443/es\"\n", "opensearch_user=\"test1\"\n", "opensearch_pass = getpass()\n", "\n", "\n", "# perform the embeddings and store in OpenSearch\n", "docsearch = OpenSearchVectorSearch.from_documents(\n", " documents=pages, \n", " embedding=embeddings, \n", " index_name=\"embd1\",\n", " opensearch_url=opensearch_url, \n", " http_auth=(opensearch_user, opensearch_pass), \n", " use_ssl = True,\n", " verify_certs = False,\n", " ssl_assert_hostname = False,\n", " ssl_show_warn = False\n", ")\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "cc2c52f1", "metadata": {}, "outputs": [], "source": [ "# This is how you can load in the index with embeddings stored to Open Search\n", "# for future runs of the notebook\n", "\n", "from langchain.vectorstores import OpenSearchVectorSearch\n", "from getpass import getpass\n", "\n", "# Open Search instance and the credentials\n", "opensearch_url=\"https://es-testspark1.cern.ch:443/es\"\n", "opensearch_user=\"test1\"\n", "opensearch_pass = getpass()\n", "\n", "\n", "# use pre-loaded embeddings in OpenSearch\n", "docsearch = OpenSearchVectorSearch(\n", " embedding_function=embeddings, \n", " index_name=\"embd1\",\n", " opensearch_url=opensearch_url, \n", " http_auth=(opensearch_user, opensearch_pass), \n", " use_ssl = True,\n", " verify_certs = False,\n", " ssl_assert_hostname = False,\n", " ssl_show_warn = False\n", ")\n" ] }, { "cell_type": "markdown", "id": "746db788", "metadata": {}, "source": [ "## 4. Perform semantic search using embeddings\n", "![Figure2](Figure2_semantic_search.png)" ] }, { "cell_type": "markdown", "id": "dab815bb", "metadata": {}, "source": [ "## Choose the index you have created and want to use for this (FAISS or Open Search)\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "225336dd", "metadata": {}, "outputs": [], "source": [ "# Choose the index (FAISS or Open Search)\n", "\n", "index = faiss_index # use FAISS in-memory index\n", "# index = docsearch # use OpenSearch Index\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "cc08486b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "37: the same data volumes as ATLAS. The HL-LHC storage requirements per year are\n", "expected to jump by a factor close to 10, which is a growth rate faster than can\n", "be accommodated by projected technology gains. Storage will remain one of the\n", "major cost drivers for HEP computing, at a level roughly equal t\n", "4: and the nuclear matter in the universe today. The ALICE experiment at the LHC [14]\n", "and the CBM [15] and PANDA [16] experiments at the Facility for Antiproton and\n", "Ion Research (FAIR) are speci\f", "cally designed to probe this aspect of nuclear and\n", "particle physics. In addition ATLAS, CMS and LHCb all con\n" ] } ], "source": [ "# Perform a simple similarity search\n", "\n", "query = \"How will computing evolve in the next decade with LHC high luminosity?\"\n", "\n", "found_docs = index.similarity_search(query, k=2)\n", "\n", "found_docs\n", "for doc in found_docs:\n", " print(str(doc.metadata[\"page\"]) + \":\", doc.page_content[:300])\n", " " ] }, { "cell_type": "markdown", "id": "43f1b538", "metadata": {}, "source": [ "## 5. Transform the results of the search into natural language using a Large Language Model" ] }, { "cell_type": "code", "execution_count": 8, "id": "6c0dc431", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "········\n" ] } ], "source": [ "# OpenAI\n", "#! pip install openai\n", "\n", "from langchain.llms import OpenAI\n", "import os\n", "\n", "from getpass import getpass\n", "OPENAI_API_KEY = getpass()\n", "\n", "os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY\n", "\n", "# This will use OpenAI models, the specified model, see also https://openai.com/pricing\n", "model = \"gpt-3.5-turbo-instruct\"\n", "llm=OpenAI(model=model)" ] }, { "cell_type": "code", "execution_count": 9, "id": "99656e8d", "metadata": {}, "outputs": [], "source": [ "from langchain.chains import RetrievalQA\n", "\n", "qa = RetrievalQA.from_chain_type(\n", " llm=llm, \n", " chain_type=\"stuff\", \n", " retriever=index.as_retriever(search_type=\"similarity\", search_kwargs={\"k\":2}), \n", " return_source_documents=True)\n" ] }, { "cell_type": "code", "execution_count": 10, "id": "8f0ee654", "metadata": { "scrolled": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n", "To disable this warning, you can either:\n", "\t- Avoid using `tokenizers` before the fork if possible\n", "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n" ] } ], "source": [ "query = \"How will computing evolve in the next decade with LHC high luminosity?\"\n", "\n", "result = qa({\"query\": query})" ] }, { "cell_type": "code", "execution_count": 17, "id": "2bbddec2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "' Computing will likely evolve to include advanced techniques such as machine learning and high rate data query systems to meet the computational constraints and extend the physics reach. This will also require more dynamic data management and access systems, as well as specialised processor resources such as GPUs.'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result['result']" ] }, { "cell_type": "code", "execution_count": null, "id": "f0c04538", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "@webio": { "lastCommId": null, "lastKernelId": null }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 5 }