LangChain in Detail!!

Nikhil pentapalli
5 min readJul 26, 2023

--

What is Langchain and why it became so popular?

Langchain is a high-level code abstracting all the complexities using the recent Large language models. It is a framework that can be used for developing applications powered by LLMs.

We can directly prompt Open AI or any recent LLM APIs without the need for Langchain (by using variables and Python f-strings). But Langchain provides abstraction and reusable components that can be integrated easily when building large applications.,

Let’s understand how Data is ingested and processed in Langchain end to end

image credits: https://docs.langchain.com/

Document Loading stage: Use can pass any input format and set up extractors at this stage. Let’s understand!
1)If a document is .doc or docx you can directly parse text
2)If the doc is PDF and has a native text layer you can use PyPDF or PyMuPDF to parse text
3)If the pdf is scanned or the document is an image then use any free ocr(tesseract) or commercial OCR(Azure form recognizer/ AWS Textract) to parse text

If it is any other format you can write your own parsers or use Unstructured.io which has a lot of extractors inbuilt which can handle and ingest data from multiple formats (Refer to this https://www.unstructured.io/)

Splitting: We split and store chunks of text information using a text-splitter module from Langchain. Why is this needed!! Keep reading

VectorDB: These text chunks are stored in Vector DB(eg: chromaDB). you can install it using pip install chromadb .

Earlier we used to store vectors in Traditional DBs for which search and retrieval consumed more time. So vector DB’s are specially designed to reduce time constraints.

How do vector DBs work?

You can use FAISS(Facebook AI similarity search) vector Index to store and retrieve similar vectors but it lacks proper storage mechanisms and scalability. So Vector DB’s are specifically designed to handle these scenarios along with improved algorithms and integrations with tools like Langchain.

  1. Convert text to vectors: Each text is converted into vectors(embeddings)
  2. Indexing: The vector database indexes vectors using algorithms such as PQ(product quantization), LSH(Local sensitive Hashing), or HNSW(Hierarchical Navigable Small World) This step maps the vectors to a data structure that will enable faster searching. These algorithms are not complex to understand how they work. So you can read more about them here
  3. Querying: with an input that is again converted to a vector using any specific embedding model, we start performing querying the DB for the closest match.
    There are many similarity measures like Cosine similarity, dot product, euclidean distance(L2)
  4. Post-processing: along with vector Data we use vector metadata to filter before or after querying.

Retrieval:
Prompt Engineering/Templating — you can customize the prompt and set variables in the prompt to reuse the same across multiple calls and use cases.
eg:

template_string = """Translate the text that is delimited by triple backticks into a style that is {style}
text: ```{text}```

Prompt Input Variables:

Variablesprompt_template.messages[0].prompt.input_variables
#outputs style and text as variables that can be configured

Output Parsers:
Langchain provides output parsers as data from LLM’s can be varying and to get it into a consistent format we can use these parsers. Think of it like a postprocessor! We can specific JSON schema or any output structure.

#Example taken from langchain documentation
from langchain.output_parsers import ResponseSchema,StructuredOutputParser
#describe your response schemas here
response_schemas = [
ResponseSchema(name="answer", description="answer to the user's question"),
ResponseSchema(name="source", description="source used to answer the user's question, should be a website.")
]
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
format_instructions = output_parser.get_format_instructions()
prompt = PromptTemplate(
template="answer the users question as best as possible.\n{format_instructions}\n{question}",
input_variables=["question"],
partial_variables={"format_instructions": format_instructions}
)
model = OpenAI(temperature=0)
_input = prompt.format_prompt(question="what's the capital of france?")
output = model(_input.to_string())
output_parser.parse(output)

#Result
{'answer': 'Paris',
'source': 'https://www.worldatlas.com/articles/what-is-the-capital-of-france.html'}

So we defined the output structure to have two variables answer and source.

Understanding LangChain Data Ecosystem and UseCases:

Image credits: https://docs.langchain.com

The above flow describes how data is ingested and converted to vector storage. Currently, Langchain supports integration with multiple vectors DB which can be found here

UseCase: Given a N pages document/pdf or huge text you can extract any entity from it without worrying about context length/Token limit. Also, you can chat with the pdf assuming the context and previous responses are stored as well. let's focus on the first part.

Let’s consider ChatGPT API which now has a token limit of 4096 tokens. So how do we know what text should be sent from N pages to LLM to extract specific entities?? It might be on any page/part of the document!

That is where Text Chunking and similarity search help to send only relevant text to LLMs in order to reduce tokens thereby reducing costs.

This also helps to speed up inference in case of any other open-source LLMs as we are reducing the input tokens. We will cover recent open-source LLMs like Dolly, Falcon, and GPT4All in the next series.

Let’s Build!!

import os
os.environ["OPENAI_API_KEY"] = "your-api-key-here"
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.llms import OpenAI


# Document loader
from langchain.document_loaders import PyPDFLoaderloader = PyPDFLoader(file.name)
data = loader.load()
#if you want to load text/scrape text from webpage. you can use
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("<URL>")
data = loader.load()

Text Chunking:

#This is used to split text based on chunk_size provided.
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0)
all_splits = text_splitter.split_documents(data)# Store
Vector DB using OPENAI Embeddings to convert text to vectors
vectorstore = Chroma.from_documents(documents=all_splits,embedding=OpenAIEmbeddings())
question = "What are price of car from the text is USD?"
docs = vectorstore.similarity_search(question)
#return docs of length n containing the text related to the question
#which are then sent to LLM
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
qa_chain = RetrievalQA.from_chain_type(llm,retriever=vectorstore.as_retriever())
qa_chain({"query": question})# Build prompt
from langchain.prompts import PromptTemplate
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum and keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context", "question"],template=template,)
#here context and question are input variables from the template prompt
# Run chain
from langchain.chains import RetrievalQA
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
qa_chain = RetrievalQA.from_chain_type(llm, retriever=vectorstore.as_retriever(),chain_type_kwargs={"prompt": QA_CHAIN_PROMPT})
result = qa_chain({"query": question})
result["result"]
#to return Source Documents
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(llm,retriever=vectorstore.as_retriever(),return_source_documents=True)
result = qa_chain({"query": question})
print(len(result['source_documents']))
result['source_documents'][0]
#with citations chain

from langchain.chains import RetrievalQAWithSources
Chainqa_chain = RetrievalQAWithSourcesChain.from_chain_type(llm,retriever=vectorstore.as_retriever())
result = qa_chain({"question": question})result

Bonus:
Explore the perplexity.ai website(a startup that focuses on search and discovery). Basically Chatgpt with internet access! which uses the citations and sources feature to show search results which is really useful.

As open-source LLMs are improving and leaderboards can be found here they are free to use. I will be diving deep into some open-source LLMs in the coming posts.

Resources:

--

--

Nikhil pentapalli

Data Scientist,Machine learning Engineer|Love to Share the knowledge and empower data science enthusiasts.