Algorithm Trading Indian Stock Market Analysis Report - Algorithm Training Institute in KPHB KUKATPALLY HYDERABAD

from pypdf import PdfReader from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_core.documents import BaseDocumentTransformer, Document pdf= PdfReader*’./pdf-docs/oci-ai-foundations.pdf’) # Transform the doc into text text=”” for page in pdf.pages: text+=page.extract_text # Convert text into smaller chunks text_splitter = RecursiveCharacterTextSplitter

You’re very close — just a few syntax and logic fixes are needed in your code. Here’s the corrected and working version of your snippet to read a PDF, extract text, and split it into chunks using LangChain’s RecursiveCharacterTextSplitter:

from pypdf import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

# Load the PDF file
pdf = PdfReader('./pdf-docs/oci-ai-foundations.pdf')

# Extract text from all pages
text = ""
for page in pdf.pages:
    text += page.extract_text()  # <-- add parentheses to call the method

# Split the text into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # you can adjust this value
    chunk_overlap=100,  # overlap between chunks
    separators=["\n\n", "\n", " ", ""]
)

# Create document objects
docs = [Document(page_content=chunk) for chunk in text_splitter.split_text(text)]

# Optional: print the first chunk to verify
print(docs[0].page_content)

✅ Fixes made:

PdfReader*'...' → should be PdfReader('...').
page.extract_text → must be page.extract_text().
Added a chunk_size and chunk_overlap for the text splitter.
Wrapped each chunk in a Document object, as LangChain typically expects this format.

Would you like me to show how to feed these docs into an embedding model (like OpenAI or Hugging Face) next?

rewrite in a more understandable way

Here’s a clearer and more beginner-friendly version of your script with comments explaining every step:

# Import the required libraries
from pypdf import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

# Step 1: Read the PDF file
pdf = PdfReader('./pdf-docs/oci-ai-foundations.pdf')

# Step 2: Extract text from all pages
text = ""
for page in pdf.pages:
    text += page.extract_text()  # extract text from each page and add it together

# Step 3: Split the long text into smaller, manageable chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,     # maximum characters per chunk
    chunk_overlap=100,   # overlap so context isn’t lost between chunks
    separators=["\n\n", "\n", " ", ""]
)

# Step 4: Split the text using the splitter
chunks = text_splitter.split_text(text)

# Step 5: Wrap each chunk in a LangChain Document object
docs = [Document(page_content=chunk) for chunk in chunks]

# Step 6: (Optional) Print the first chunk to verify the output
print(docs[0].page_content)

🧩 Explanation

PdfReader reads your PDF file.
extract_text() pulls out all the text from each page.
RecursiveCharacterTextSplitter divides the big text into smaller chunks that are easier for AI models to process later.
Document objects store each chunk, which you can later feed into embedding or question-answering models.

Would you like me to add the next step — how to create embeddings from these chunks for search or question answering?

# Step 5: Wrap each chunk in a LangChain Document object docs = [Document(page_content=chunk) for chunk in chunks]

Good — that line is correct, but let’s make it a bit clearer by explaining exactly what it does and how you can extend it if needed.

Here’s the improved version with explanation and options:

💡 What this does:

Document is a LangChain data structure used to store text and its context (metadata).
Adding metadata helps later when you want to trace where a specific answer or chunk came from — for example, the filename, section, or page number.

If you want to include the page number from which each chunk came (for better traceability), you can modify it like this:

Would you like me to show how to connect these docs to an embedding model (e.g., OpenAI or Hugging Face) to make them searchable next?

declare a function to store to format and add meta data to oracle 23 ai vector store

Excellent — you’re now moving toward vectorizing your PDF text and storing it in Oracle 23 AI Vector Store (part of Oracle Database 23c).

Let’s write a clean, reusable Python function that:

Takes your text chunks (chunks) as input.
Formats them as Document objects with metadata.
Creates embeddings.
Stores them in Oracle’s AI Vector Store.

Below is a conceptual + practical example using LangChain’s Oracle VectorStore integration:

from langchain_community.vectorstores import OracleVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document

def store_to_oracle_vector(chunks, source_file, collection_name="oci_ai_docs"):
    """
    Formats text chunks with metadata and stores them in Oracle 23 AI Vector Store.

    Parameters:
    - chunks (list): List of text chunks to store.
    - source_file (str): The name/path of the PDF source.
    - collection_name (str): The target collection name in Oracle vector store.

    Returns:
    - vector_store (OracleVectorStore): The connected and populated vector store instance.
    """

    # Step 1: Format each chunk as a LangChain Document with metadata
    docs = [
        Document(
            page_content=chunk,
            metadata={
                "source": source_file,
                "chunk_index": i,
                "category": "training_material"
            }
        )
        for i, chunk in enumerate(chunks)
    ]

    # Step 2: Create embeddings (using OpenAI or any embedding model)
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

    # Step 3: Connect to Oracle Vector Store
    vector_store = OracleVectorStore.from_documents(
        documents=docs,
        embedding=embeddings,
        collection_name=collection_name,
        connection_string="oracle+oracledb://user:password@hostname:port/service_name"
    )

    print(f"✅ Successfully stored {len(docs)} documents to collection '{collection_name}' in Oracle 23 AI Vector Store.")
    return vector_store

🧩 Explanation

Document: Wraps each text chunk with metadata (like filename and index).
OpenAIEmbeddings: Converts your text chunks into numerical vectors.
OracleVectorStore: Saves the embeddings in Oracle Database 23c’s vector store.
collection_name: Defines a logical group in the database for your document embeddings.
connection_string: Replace with your actual Oracle DB credentials.

Would you like me to extend this function to automatically retrieve and query similar chunks (semantic search) from the Oracle vector store too?

rewrite using langchain_core

Perfect — using the langchain_core API directly gives you a cleaner and more framework-agnostic design, while keeping compatibility with Oracle’s Vector Store and any embedding model you choose later.

Here’s a rewritten version of the function that:

Uses only langchain_core primitives for documents and embeddings.
Keeps the Oracle Vector Store integration modular and future-proof.

from langchain_core.documents import Document
from langchain_core.embeddings import Embeddings
from langchain_community.vectorstores import OracleVectorStore


def store_to_oracle_vector(
    chunks: list[str],
    source_file: str,
    embedding_model: Embeddings,
    connection_string: str,
    collection_name: str = "oci_ai_docs"
):
    """
    Convert text chunks into Documents, embed them, and store them in Oracle 23 AI Vector Store.

    Parameters
    ----------
    chunks : list[str]
        The list of text chunks extracted from your PDF or other sources.
    source_file : str
        Name or path of the source document (used in metadata).
    embedding_model : Embeddings
        Any LangChain-compatible embedding model (e.g., OpenAIEmbeddings, HuggingFaceEmbeddings).
    connection_string : str
        Oracle Database connection string in the format:
        oracle+oracledb://user:password@hostname:port/service_name
    collection_name : str, optional
        Logical name of the vector collection in Oracle (default "oci_ai_docs").

    Returns
    -------
    OracleVectorStore
        The initialized and populated Oracle Vector Store object.
    """

    # Step 1: Create Document objects with metadata
    docs = [
        Document(
            page_content=chunk,
            metadata={
                "source": source_file,
                "chunk_index": i,
                "doc_type": "pdf"
            }
        )
        for i, chunk in enumerate(chunks)
    ]

    # Step 2: Initialize Oracle Vector Store and store documents
    vector_store = OracleVectorStore.from_documents(
        documents=docs,
        embedding=embedding_model,
        collection_name=collection_name,
        connection_string=connection_string
    )

    print(f"✅ Stored {len(docs)} chunks from '{source_file}' into Oracle Vector Store collection '{collection_name}'.")
    return vector_store

🧩 Example Usage

Would you like me to show the matching query/search function next — to retrieve the most similar text chunks from your Oracle 23 AI vector store?

Use BaseDocumentTransofrmer

Excellent — you’re thinking in the LangChain Core 0.2+ way, where reusable logic (like formatting, metadata injection, and pre-processing) belongs in a BaseDocumentTransformer.

Let’s build a clean, production-style class that:

Inherits from BaseDocumentTransformer
Takes raw text chunks
Wraps them into Document objects with metadata
(Optionally) pushes them into Oracle 23 AI Vector Store

Here’s how to do it step-by-step 👇

✅ Complete Example Using `BaseDocumentTransformer`

from typing import List
from langchain_core.documents import Document
from langchain_core.documents.base import BaseDocumentTransformer
from langchain_core.embeddings import Embeddings
from langchain_community.vectorstores import OracleVectorStore


class OracleDocumentTransformer(BaseDocumentTransformer):
    """
    Custom transformer that formats text chunks into LangChain Documents
    with metadata and stores them in Oracle 23 AI Vector Store.
    """

    def __init__(
        self,
        source_file: str,
        embedding_model: Embeddings,
        connection_string: str,
        collection_name: str = "oci_ai_docs"
    ):
        self.source_file = source_file
        self.embedding_model = embedding_model
        self.connection_string = connection_string
        self.collection_name = collection_name

    def transform_documents(self, documents: List[Document]) -> List[Document]:
        """
        Adds metadata to Documents (like source name and chunk index)
        before they are stored in the vector database.
        """
        transformed_docs = []
        for i, doc in enumerate(documents):
            # merge metadata safely
            metadata = {
                **doc.metadata,
                "source": self.source_file,
                "chunk_index": i,
                "doc_type": "pdf"
            }
            transformed_docs.append(Document(page_content=doc.page_content, metadata=metadata))
        return transformed_docs

    def store_to_oracle(self, documents: List[Document]):
        """
        Stores the transformed documents in Oracle 23 AI Vector Store.
        """
        # Apply transformation first
        transformed_docs = self.transform_documents(documents)

        # Store embeddings in Oracle
        vector_store = OracleVectorStore.from_documents(
            documents=transformed_docs,
            embedding=self.embedding_model,
            collection_name=self.collection_name,
            connection_string=self.connection_string
        )

        print(f"✅ Stored {len(transformed_docs)} documents to '{self.collection_name}' in Oracle Vector Store.")
        return vector_store

💡 How to Use It

from langchain_openai import OpenAIEmbeddings

# 1️⃣ Prepare your chunks from PDF
chunks = ["AI is transforming cloud platforms.", "OCI integrates AI services deeply."]

# 2️⃣ Wrap chunks into Document objects
docs = [Document(page_content=chunk) for chunk in chunks]

# 3️⃣ Initialize your embedding model
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

# 4️⃣ Create the transformer
transformer = OracleDocumentTransformer(
    source_file="oci-ai-foundations.pdf",
    embedding_model=embedding_model,
    connection_string="oracle+oracledb://user:password@hostname:1521/service_name",
    collection_name="ai_docs_demo"
)

# 5️⃣ Transform and store documents
transformer.store_to_oracle(docs)

🧠 Why This Is Better

Reusable — The transformer can be applied to any set of Document objects.
Composable — Works in a LangChain pipeline (before retrievers, loaders, etc.).
Maintainable — Metadata logic stays inside a single class.

Would you like me to extend this to also retrieve similar documents (semantic search) from Oracle Vector Store using the same transformer class?

Use chunks_to_docs_wrapper

Perfect — using a chunks_to_docs_wrapper method is a clean, modular pattern inside LangChain Core when you want to keep chunk-to-Document conversion separate from vector storage logic.

Below is a best-practice example showing how to combine:
✅ BaseDocumentTransformer
✅ A chunks_to_docs_wrapper() helper method
✅ Integration with Oracle 23 AI Vector Store

✅ Full Implementation: Using `BaseDocumentTransformer` + `chunks_to_docs_wrapper`

from typing import List
from langchain_core.documents import Document
from langchain_core.documents.base import BaseDocumentTransformer
from langchain_core.embeddings import Embeddings
from langchain_community.vectorstores import OracleVectorStore


class OracleDocumentTransformer(BaseDocumentTransformer):
    """
    Transformer that converts text chunks into Documents with metadata,
    and can store them in Oracle 23 AI Vector Store.
    """

    def __init__(
        self,
        source_file: str,
        embedding_model: Embeddings,
        connection_string: str,
        collection_name: str = "oci_ai_docs"
    ):
        self.source_file = source_file
        self.embedding_model = embedding_model
        self.connection_string = connection_string
        self.collection_name = collection_name

    # -----------------------------------------
    # ✅ Helper: Convert chunks → Documents
    # -----------------------------------------
    def chunks_to_docs_wrapper(self, chunks: List[str]) -> List[Document]:
        """
        Converts a list of text chunks into LangChain Document objects,
        adding useful metadata like source name and chunk index.
        """
        docs = [
            Document(
                page_content=chunk,
                metadata={
                    "source": self.source_file,
                    "chunk_index": i,
                    "doc_type": "pdf"
                }
            )
            for i, chunk in enumerate(chunks)
        ]
        return docs

    # -----------------------------------------
    # ✅ Core: Transform existing Documents
    # -----------------------------------------
    def transform_documents(self, documents: List[Document]) -> List[Document]:
        """
        Adds or updates metadata for existing Document objects.
        """
        transformed = []
        for i, doc in enumerate(documents):
            metadata = {
                **doc.metadata,
                "source": self.source_file,
                "chunk_index": i,
                "doc_type": "pdf"
            }
            transformed.append(Document(page_content=doc.page_content, metadata=metadata))
        return transformed

    # -----------------------------------------
    # ✅ Store Documents into Oracle Vector Store
    # -----------------------------------------
    def store_to_oracle(self, chunks: List[str]):
        """
        Converts chunks to Documents, applies transformations, and stores in Oracle 23 AI Vector Store.
        """
        # Step 1: Convert chunks → Document objects
        docs = self.chunks_to_docs_wrapper(chunks)

        # Step 2: Apply any metadata transformations
        transformed_docs = self.transform_documents(docs)

        # Step 3: Store in Oracle vector store
        vector_store = OracleVectorStore.from_documents(
            documents=transformed_docs,
            embedding=self.embedding_model,
            collection_name=self.collection_name,
            connection_string=self.connection_string
        )

        print(f"✅ Stored {len(transformed_docs)} documents to Oracle Vector Store collection '{self.collection_name}'.")
        return vector_store

💡 Example Usage

from langchain_openai import OpenAIEmbeddings

# 1️⃣ Sample text chunks
chunks = [
    "Oracle 23c introduces native AI Vector Search capabilities.",
    "LangChain integrates with Oracle Vector Store for AI-powered retrieval."
]

# 2️⃣ Initialize embedding model
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

# 3️⃣ Create transformer
transformer = OracleDocumentTransformer(
    source_file="oci-ai-foundations.pdf",
    embedding_model=embedding_model,
    connection_string="oracle+oracledb://user:password@hostname:1521/service_name",
    collection_name="oci_ai_demo"
)

# 4️⃣ Store the chunks
transformer.store_to_oracle(chunks)

🧠 Key Advantages

chunks_to_docs_wrapper() cleanly separates formatting from storage — reusable across pipelines.
transform_documents() aligns with LangChain’s BaseDocumentTransformer pattern.
Oracle 23 AI Vector Store integration is direct and production-ready.

ChatGPT said:

✅ Fixes made:

You said:

ChatGPT said:

🧩 Explanation

You said:

ChatGPT said:

💡 What this does:

You said:

ChatGPT said:

🧩 Explanation

You said:

ChatGPT said:

🧩 Example Usage

You said:

ChatGPT said:

✅ Complete Example Using BaseDocumentTransformer

💡 How to Use It

🧠 Why This Is Better

You said:

ChatGPT said:

✅ Full Implementation: Using BaseDocumentTransformer + chunks_to_docs_wrapper

💡 Example Usage

🧠 Key Advantages

✅ Complete Example Using `BaseDocumentTransformer`

✅ Full Implementation: Using `BaseDocumentTransformer` + `chunks_to_docs_wrapper`