Harness the Power of ChatGPT to Uncover Insights from Your Own Data

Written with GPT-4

Apr 16, 2023

You can now utilize the power of large language models (LLMs) like ChatGPT to answer questions about your data. In this blog post, I will guide you through the process of implementing a simple web application that integrates ChatGPT to answer queries about your own data.

We will be using the Langchain prompt chaining library to build an interactive chat interface that communicates with OpenAI’s GPT endpoints, and we will perform a semantic search to obtain relevant context for user queries. This approach doesn’t involve any fine-tuning of language models. Instead, we’ll leverage a vector database to power our search and provide the context to the LLM for generating accurate answers.

Demo:

Understanding the Basics:

Before diving into the implementation, it’s crucial to understand some key concepts related to large language models, semantic search, and vector databases.

The context for Large Language Models (LLMs):

Context is the background information or relevant details that guide an LLM like ChatGPT in generating meaningful responses. In our case, context is extracted from the user’s data to help the model understand the specific domain and provide accurate answers. By providing an appropriate context, we ensure that the model’s responses are tailored to the user’s dataset and query.

Semantic Search:

Semantic search is an advanced search technique that aims to understand the meaning and intent behind a query, rather than just matching keywords. It uses natural language processing (NLP) algorithms to identify relevant context and relationships between words, phrases, and concepts in a dataset. This results in more accurate and relevant search results compared to traditional keyword-based search methods.

Vector Databases:

A vector database is a specialized database designed to store and search high-dimensional vectors efficiently. In our implementation, we use Pinecone, a powerful and scalable vector database service. Pinecone enables us to perform fast and accurate semantic search by converting our data into numerical representations (vectors) and indexing them for efficient retrieval.

Implementation Steps:

You can find the full code here — https://github.com/miranthajayatilake/nanoQA2

Steps to quickstart locally:

Clone the repo git clone https://github.com/miranthajayatilake/nanoQA2.git
Move into directory cd nanoQA2
Set up your python>3.8 environment (virtual environment preferred)
Install the dependencies pip install -r requirements.txt
Assign the environment variables by running bash env-local.sh. Make sure you have the following API keys and variables replaced
OpenAI API key (OPENAI_API_KEY). You can obtain this by creating an account at OpenAI
Pinecone API key and environment name (PINECONE_API_KEY, PINECONE_ENV). Obtain these by making an account at Pinecone
Next, we have to create an index in the Pinecone account. You can use the create_index.py to do this. Make sure to provide the parameters below.
create_index.py --pinecone_api_key <asdf> --pinecone_environment <asdf> --index_name <asdf>
Copy the index name used above (INDEX_NAME)
Provide a namespace as well (NAMESPACE) just to organize data in the database
Provide a name that you want your chatbot to have (EGPTNAME)
Run the web app with streamlit run Chat.py

The repo also gives you instructions on how to deploy the app to the cloud easily.

Code walkthrough:

First, let’s look at the Chat.py script that has chat-related components.

We will import the necessary libraries and initialize Pinecone.

# Import necessary libraries
import os
import streamlit as st
import pinecone
from PIL import Image
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.chat_models import ChatOpenAI
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.schema import (
    AIMessage,
    HumanMessage,
    SystemMessage
)
from langchain.chains import ChatVectorDBChain

# Initialize Pinecone
pinecone.init(api_key=os.environ["PINECONE_API_KEY"], environment=os.environ["PINECONE_ENV"])

# Load logo and display it
image = Image.open('utils/logo.jpg')
st.image(image, width=200)

# Set the title
st.title(f'{os.environ["EGPTNAME"]}')

Next, we’ll initialize the embeddings and vector store using OpenAI Embeddings and Pinecone. Also, we’ll create the chat prompt templates and initialize the ChatVectorDBChain. This will allow us to use the ChatOpenAI model with a temperature parameter to control the randomness of the generated responses.

# Initialize embeddings and vector store
embeddings = OpenAIEmbeddings()
index = pinecone.Index(os.environ["INDEX_NAME"])
vectorstore = Pinecone(index, embeddings.embed_query, text_key='text', namespace=os.environ["NAMESPACE"])

# Define the system message template
system_template = """Use the following pieces of context to answer the users question. 
If you cannot find the answer from the pieces of context, just say that you don't know, don't try to make up an answer.
----------------
{context}"""

# Create the chat prompt templates
messages = [
    SystemMessagePromptTemplate.from_template(system_template),
    HumanMessagePromptTemplate.from_template("{question}")
]
prompt = ChatPromptTemplate.from_messages(messages)

# Initialize the ChatVectorDBChain
qa = ChatVectorDBChain.from_llm(ChatOpenAI(temperature=0), vectorstore, qa_prompt=prompt, return_source_documents=True)

Next, we will initialize the chat history and define a function to execute a query. The function will display the generated response from GPT-3.5 and the sources used for the latest answer.

# Initialize the chat history
chat_history = []

if 'chat_history' not in st.session_state:
    st.session_state['chat_history'] = []

# Define the function to execute a query
def execute_query(query):
    with st.spinner('Thinking...'):
        for i in range(len(st.session_state['chat_history'])-1, -1, -1):
            chat_history.append(st.session_state['chat_history'][i])

        result = qa({"question": query, "chat_history": chat_history})

        st.session_state.chat_history.append((query, result["answer"]))
        chat_history.append((query, result["answer"]))

    st.info(query)
    st.success(result['answer'])

    # Display the sources for the latest answer
    with st.expander("Sources for the latest answer"):
        sources = result['source_documents']
        for idx, i in enumerate(sources):
            st.markdown(f"**Source number {idx + 1}** \n")
            st.markdown(i)
            st.write('-'*10)

    # Display the previous chat history
    if len(chat_history) > 1:
        for query, answer in chat_history[:-1]:
            st.info(query)
            st.success(answer)

# Create input field and sample question buttons
query = st.text_input("Ask a question or tell what to do:", key="input")

# Execute the selected sample query
if query:
    execute_query(query)

That completes the main portion of the code.

I created a separate page to handle uploading data to Pinecone. I believe the most useful features would be the ability to upload PDFs and provide URLs. Given a URL the app will automatically scrape that page and index the text into the database.

The code for this resides in the pages/Contribute_data.py script. I encourage you to clone the repo and go through it. But here’s a snippet of how URLs are handled.

with st.expander("URL"):
    url_input = st.text_input('URL', '')

    if st.button("Read from URL"):

        with st.spinner('Wait for it...'):
            urls = [
                url_input
            ]
            loader = UnstructuredURLLoader(urls=urls)

            documents = loader.load()

            text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
            documents = text_splitter.split_documents(documents)

            embeddings = OpenAIEmbeddings()

            isNotDone = True
            while(isNotDone):
                try:
                    reinitate_connetion()

                    vectorstore = Pinecone.from_documents(documents, embeddings, text_key='text', index_name=INDEX_NAME, namespace=NAMESPACE)
                    isNotDone = False
                except:
                    pass

        st.info('Done')

That’s pretty much all of the main components you need. Again, the full code can be found at https://github.com/miranthajayatilake/nanoQA2

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

🌟 Please feel free to fork the repository, star it, and contribute to the project. I am looking forward to hearing your thoughts, suggestions, and any success stories that result from using this code. 🚀

I’ve enjoyed writing and sharing this. Thanks for reading!

Mirantha Jayathilaka

Discussion about this post