오지's blog

snowflake rag구성시 csv파일 chunker 본문

개발노트/Python

snowflake rag구성시 csv파일 chunker

잡스러운노트, 잡스노트 2024. 10. 10. 13:50
728x90
반응형
create or replace function csv_text_chunker(file_url string)
returns table (chunk varchar)
language python
runtime_version = '3.9'
handler = 'csv_text_chunker'
packages = ('snowflake-snowpark-python','pandas', 'langchain')
as
$$
from snowflake.snowpark.types import StringType, StructField, StructType
from langchain.text_splitter import RecursiveCharacterTextSplitter
from snowflake.snowpark.files import SnowflakeFile
import io
import logging
import pandas as pd

class csv_text_chunker:

    def read_csv_chunk(self, file_url: str) -> str:
    
        logger = logging.getLogger("udf_logger")
        logger.info(f"Opening file {file_url}")
    
        with SnowflakeFile.open(file_url, 'rb') as f:
            buffer = io.BytesIO(f.readall())
        df = pd.read_csv(buffer)
        text = " ".join(df.astype(str).values.flatten())
        return text

    def process(self,file_url: str):

        text = self.read_csv_chunk(file_url)
        
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size = 4000, #Adjust this as you see fit
            chunk_overlap  = 400, #This let's text have some form of overlap. Useful for keeping chunks contextual
            length_function = len
        )
    
        chunks = text_splitter.split_text(text)
        df = pd.DataFrame(chunks, columns=['chunks'])
        
        yield from df.itertuples(index=False, name=None)
$$;

 

 

https://quickstarts.snowflake.com/guide/asking_questions_to_your_own_documents_with_snowflake_cortex

 

Build A Document Search Assistant using Vector Embeddings in Cortex AI

In the previous section we have created a simple interface where we can ask questions about our documents and select the LLM running within Snowflake Cortex to answer the question. We have seen that when no context from our documents is provided, we just g

quickstarts.snowflake.com

 

가이드를 참고하여 pdf파일 chunker말고 csv파일 chunker를 만들어 보았다.

Comments