ChatGPT said:
You’re very close — just a few syntax and logic fixes are needed in your code. Here’s the corrected and working version of your snippet to read a PDF, extract text, and split it into chunks using LangChain’s RecursiveCharacterTextSplitter:
from pypdf import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
# Load the PDF file
pdf = PdfReader('./pdf-docs/oci-ai-foundations.pdf')
# Extract text from all pages
text = ""
for page in pdf.pages:
text += page.extract_text() # <-- add parentheses to call the method
# Split the text into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # you can adjust this value
chunk_overlap=100, # overlap between chunks
separators=["\n\n", "\n", " ", ""]
)
# Create document objects
docs = [Document(page_content=chunk) for chunk in text_splitter.split_text(text)]
# Optional: print the first chunk to verify
print(docs[0].page_content)
✅ Fixes made:
-
PdfReader*'...'→ should bePdfReader('...'). -
page.extract_text→ must bepage.extract_text(). -
Added a
chunk_sizeandchunk_overlapfor the text splitter. -
Wrapped each chunk in a
Documentobject, as LangChain typically expects this format.
Would you like me to show how to feed these docs into an embedding model (like OpenAI or Hugging Face) next?
You said:
ChatGPT said:
Here’s a clearer and more beginner-friendly version of your script with comments explaining every step:
# Import the required libraries
from pypdf import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
# Step 1: Read the PDF file
pdf = PdfReader('./pdf-docs/oci-ai-foundations.pdf')
# Step 2: Extract text from all pages
text = ""
for page in pdf.pages:
text += page.extract_text() # extract text from each page and add it together
# Step 3: Split the long text into smaller, manageable chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # maximum characters per chunk
chunk_overlap=100, # overlap so context isn’t lost between chunks
separators=["\n\n", "\n", " ", ""]
)
# Step 4: Split the text using the splitter
chunks = text_splitter.split_text(text)
# Step 5: Wrap each chunk in a LangChain Document object
docs = [Document(page_content=chunk) for chunk in chunks]
# Step 6: (Optional) Print the first chunk to verify the output
print(docs[0].page_content)
🧩 Explanation
-
PdfReaderreads your PDF file. -
extract_text()pulls out all the text from each page. -
RecursiveCharacterTextSplitterdivides the big text into smaller chunks that are easier for AI models to process later. -
Documentobjects store each chunk, which you can later feed into embedding or question-answering models.
Would you like me to add the next step — how to create embeddings from these chunks for search or question answering?
You said:
ChatGPT said:
Good — that line is correct, but let’s make it a bit clearer by explaining exactly what it does and how you can extend it if needed.
Here’s the improved version with explanation and options:
# Step 5: Convert each text chunk into a LangChain Document object.
# Each Document stores both the text (page_content) and optional metadata (like page number, source, etc.)
docs = [
Document(
page_content=chunk, # the actual chunk of text
metadata={"source": "oci-ai-foundations.pdf"} # optional metadata
)
for chunk in chunks
]
💡 What this does:
-
Documentis a LangChain data structure used to store text and its context (metadata). -
Adding metadata helps later when you want to trace where a specific answer or chunk came from — for example, the filename, section, or page number.
If you want to include the page number from which each chunk came (for better traceability), you can modify it like this:
docs = []
for i, chunk in enumerate(chunks):
docs.append(Document(
page_content=chunk,
metadata={"source": "oci-ai-foundations.pdf", "chunk_index": i}
))
Would you like me to show how to connect these docs to an embedding model (e.g., OpenAI or Hugging Face) to make them searchable next?
You said:
ChatGPT said:
Excellent — you’re now moving toward vectorizing your PDF text and storing it in Oracle 23 AI Vector Store (part of Oracle Database 23c).
Let’s write a clean, reusable Python function that:
-
Takes your text chunks (
chunks) as input. -
Formats them as
Documentobjects with metadata. -
Creates embeddings.
-
Stores them in Oracle’s AI Vector Store.
Below is a conceptual + practical example using LangChain’s Oracle VectorStore integration:
from langchain_community.vectorstores import OracleVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document
def store_to_oracle_vector(chunks, source_file, collection_name="oci_ai_docs"):
"""
Formats text chunks with metadata and stores them in Oracle 23 AI Vector Store.
Parameters:
- chunks (list): List of text chunks to store.
- source_file (str): The name/path of the PDF source.
- collection_name (str): The target collection name in Oracle vector store.
Returns:
- vector_store (OracleVectorStore): The connected and populated vector store instance.
"""
# Step 1: Format each chunk as a LangChain Document with metadata
docs = [
Document(
page_content=chunk,
metadata={
"source": source_file,
"chunk_index": i,
"category": "training_material"
}
)
for i, chunk in enumerate(chunks)
]
# Step 2: Create embeddings (using OpenAI or any embedding model)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Step 3: Connect to Oracle Vector Store
vector_store = OracleVectorStore.from_documents(
documents=docs,
embedding=embeddings,
collection_name=collection_name,
connection_string="oracle+oracledb://user:password@hostname:port/service_name"
)
print(f"✅ Successfully stored {len(docs)} documents to collection '{collection_name}' in Oracle 23 AI Vector Store.")
return vector_store
🧩 Explanation
-
Document: Wraps each text chunk with metadata (like filename and index). -
OpenAIEmbeddings: Converts your text chunks into numerical vectors. -
OracleVectorStore: Saves the embeddings in Oracle Database 23c’s vector store. -
collection_name: Defines a logical group in the database for your document embeddings. -
connection_string: Replace with your actual Oracle DB credentials.
Would you like me to extend this function to automatically retrieve and query similar chunks (semantic search) from the Oracle vector store too?
You said:
ChatGPT said:
Perfect — using the langchain_core API directly gives you a cleaner and more framework-agnostic design, while keeping compatibility with Oracle’s Vector Store and any embedding model you choose later.
Here’s a rewritten version of the function that:
-
Uses only
langchain_coreprimitives for documents and embeddings. -
Keeps the Oracle Vector Store integration modular and future-proof.
from langchain_core.documents import Document
from langchain_core.embeddings import Embeddings
from langchain_community.vectorstores import OracleVectorStore
def store_to_oracle_vector(
chunks: list[str],
source_file: str,
embedding_model: Embeddings,
connection_string: str,
collection_name: str = "oci_ai_docs"
):
"""
Convert text chunks into Documents, embed them, and store them in Oracle 23 AI Vector Store.
Parameters
----------
chunks : list[str]
The list of text chunks extracted from your PDF or other sources.
source_file : str
Name or path of the source document (used in metadata).
embedding_model : Embeddings
Any LangChain-compatible embedding model (e.g., OpenAIEmbeddings, HuggingFaceEmbeddings).
connection_string : str
Oracle Database connection string in the format:
oracle+oracledb://user:password@hostname:port/service_name
collection_name : str, optional
Logical name of the vector collection in Oracle (default "oci_ai_docs").
Returns
-------
OracleVectorStore
The initialized and populated Oracle Vector Store object.
"""
# Step 1: Create Document objects with metadata
docs = [
Document(
page_content=chunk,
metadata={
"source": source_file,
"chunk_index": i,
"doc_type": "pdf"
}
)
for i, chunk in enumerate(chunks)
]
# Step 2: Initialize Oracle Vector Store and store documents
vector_store = OracleVectorStore.from_documents(
documents=docs,
embedding=embedding_model,
collection_name=collection_name,
connection_string=connection_string
)
print(f"✅ Stored {len(docs)} chunks from '{source_file}' into Oracle Vector Store collection '{collection_name}'.")
return vector_store
🧩 Example Usage
from langchain_openai import OpenAIEmbeddings
# Example: create embeddings and store chunks
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
connection = "oracle+oracledb://admin:MyPass@myhost:1521/myservice"
store_to_oracle_vector(
chunks=chunks,
source_file="oci-ai-foundations.pdf",
embedding_model=embedding_model,
connection_string=connection,
collection_name="ai_docs_demo"
)
Would you like me to show the matching query/search function next — to retrieve the most similar text chunks from your Oracle 23 AI vector store?
You said:
ChatGPT said:
Excellent — you’re thinking in the LangChain Core 0.2+ way, where reusable logic (like formatting, metadata injection, and pre-processing) belongs in a BaseDocumentTransformer.
Let’s build a clean, production-style class that:
-
Inherits from
BaseDocumentTransformer -
Takes raw text chunks
-
Wraps them into
Documentobjects with metadata -
(Optionally) pushes them into Oracle 23 AI Vector Store
Here’s how to do it step-by-step 👇
✅ Complete Example Using BaseDocumentTransformer
from typing import List
from langchain_core.documents import Document
from langchain_core.documents.base import BaseDocumentTransformer
from langchain_core.embeddings import Embeddings
from langchain_community.vectorstores import OracleVectorStore
class OracleDocumentTransformer(BaseDocumentTransformer):
"""
Custom transformer that formats text chunks into LangChain Documents
with metadata and stores them in Oracle 23 AI Vector Store.
"""
def __init__(
self,
source_file: str,
embedding_model: Embeddings,
connection_string: str,
collection_name: str = "oci_ai_docs"
):
self.source_file = source_file
self.embedding_model = embedding_model
self.connection_string = connection_string
self.collection_name = collection_name
def transform_documents(self, documents: List[Document]) -> List[Document]:
"""
Adds metadata to Documents (like source name and chunk index)
before they are stored in the vector database.
"""
transformed_docs = []
for i, doc in enumerate(documents):
# merge metadata safely
metadata = {
**doc.metadata,
"source": self.source_file,
"chunk_index": i,
"doc_type": "pdf"
}
transformed_docs.append(Document(page_content=doc.page_content, metadata=metadata))
return transformed_docs
def store_to_oracle(self, documents: List[Document]):
"""
Stores the transformed documents in Oracle 23 AI Vector Store.
"""
# Apply transformation first
transformed_docs = self.transform_documents(documents)
# Store embeddings in Oracle
vector_store = OracleVectorStore.from_documents(
documents=transformed_docs,
embedding=self.embedding_model,
collection_name=self.collection_name,
connection_string=self.connection_string
)
print(f"✅ Stored {len(transformed_docs)} documents to '{self.collection_name}' in Oracle Vector Store.")
return vector_store
💡 How to Use It
from langchain_openai import OpenAIEmbeddings
# 1️⃣ Prepare your chunks from PDF
chunks = ["AI is transforming cloud platforms.", "OCI integrates AI services deeply."]
# 2️⃣ Wrap chunks into Document objects
docs = [Document(page_content=chunk) for chunk in chunks]
# 3️⃣ Initialize your embedding model
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
# 4️⃣ Create the transformer
transformer = OracleDocumentTransformer(
source_file="oci-ai-foundations.pdf",
embedding_model=embedding_model,
connection_string="oracle+oracledb://user:password@hostname:1521/service_name",
collection_name="ai_docs_demo"
)
# 5️⃣ Transform and store documents
transformer.store_to_oracle(docs)
🧠 Why This Is Better
-
Reusable — The transformer can be applied to any set of
Documentobjects. -
Composable — Works in a LangChain pipeline (before retrievers, loaders, etc.).
-
Maintainable — Metadata logic stays inside a single class.
Would you like me to extend this to also retrieve similar documents (semantic search) from Oracle Vector Store using the same transformer class?
You said:
ChatGPT said:
Perfect — using a chunks_to_docs_wrapper method is a clean, modular pattern inside LangChain Core when you want to keep chunk-to-Document conversion separate from vector storage logic.
Below is a best-practice example showing how to combine:
✅ BaseDocumentTransformer
✅ A chunks_to_docs_wrapper() helper method
✅ Integration with Oracle 23 AI Vector Store
✅ Full Implementation: Using BaseDocumentTransformer + chunks_to_docs_wrapper
from typing import List
from langchain_core.documents import Document
from langchain_core.documents.base import BaseDocumentTransformer
from langchain_core.embeddings import Embeddings
from langchain_community.vectorstores import OracleVectorStore
class OracleDocumentTransformer(BaseDocumentTransformer):
"""
Transformer that converts text chunks into Documents with metadata,
and can store them in Oracle 23 AI Vector Store.
"""
def __init__(
self,
source_file: str,
embedding_model: Embeddings,
connection_string: str,
collection_name: str = "oci_ai_docs"
):
self.source_file = source_file
self.embedding_model = embedding_model
self.connection_string = connection_string
self.collection_name = collection_name
# -----------------------------------------
# ✅ Helper: Convert chunks → Documents
# -----------------------------------------
def chunks_to_docs_wrapper(self, chunks: List[str]) -> List[Document]:
"""
Converts a list of text chunks into LangChain Document objects,
adding useful metadata like source name and chunk index.
"""
docs = [
Document(
page_content=chunk,
metadata={
"source": self.source_file,
"chunk_index": i,
"doc_type": "pdf"
}
)
for i, chunk in enumerate(chunks)
]
return docs
# -----------------------------------------
# ✅ Core: Transform existing Documents
# -----------------------------------------
def transform_documents(self, documents: List[Document]) -> List[Document]:
"""
Adds or updates metadata for existing Document objects.
"""
transformed = []
for i, doc in enumerate(documents):
metadata = {
**doc.metadata,
"source": self.source_file,
"chunk_index": i,
"doc_type": "pdf"
}
transformed.append(Document(page_content=doc.page_content, metadata=metadata))
return transformed
# -----------------------------------------
# ✅ Store Documents into Oracle Vector Store
# -----------------------------------------
def store_to_oracle(self, chunks: List[str]):
"""
Converts chunks to Documents, applies transformations, and stores in Oracle 23 AI Vector Store.
"""
# Step 1: Convert chunks → Document objects
docs = self.chunks_to_docs_wrapper(chunks)
# Step 2: Apply any metadata transformations
transformed_docs = self.transform_documents(docs)
# Step 3: Store in Oracle vector store
vector_store = OracleVectorStore.from_documents(
documents=transformed_docs,
embedding=self.embedding_model,
collection_name=self.collection_name,
connection_string=self.connection_string
)
print(f"✅ Stored {len(transformed_docs)} documents to Oracle Vector Store collection '{self.collection_name}'.")
return vector_store
💡 Example Usage
from langchain_openai import OpenAIEmbeddings
# 1️⃣ Sample text chunks
chunks = [
"Oracle 23c introduces native AI Vector Search capabilities.",
"LangChain integrates with Oracle Vector Store for AI-powered retrieval."
]
# 2️⃣ Initialize embedding model
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
# 3️⃣ Create transformer
transformer = OracleDocumentTransformer(
source_file="oci-ai-foundations.pdf",
embedding_model=embedding_model,
connection_string="oracle+oracledb://user:password@hostname:1521/service_name",
collection_name="oci_ai_demo"
)
# 4️⃣ Store the chunks
transformer.store_to_oracle(chunks)
🧠 Key Advantages
-
chunks_to_docs_wrapper()cleanly separates formatting from storage — reusable across pipelines. -
transform_documents()aligns with LangChain’sBaseDocumentTransformerpattern. -
Oracle 23 AI Vector Store integration is direct and production-ready.
