如何在 CSV 文件上进行问答
LLMs 非常适合构建各种数据源上的问答系统。在本节中,我们将讨论如何构建基于存储在 CSV 文件中的数据的问答系统。与 SQL 数据库的工作类似,处理 CSV 文件的关键是为 LLM 提供查询和与数据交互的工具。主要有两种方法可以做到这一点:
- 推荐:将 CSV 文件加载到 SQL 数据库中,并使用 SQL 教程 中概述的方法。
- 让 LLM 访问 Python 环境,以便它可以使用 Pandas 等库与数据进行交互。
本指南将涵盖这两种方法。
⚠️ 安全注意事项 ⚠️
上述两种方法都存在重大风险。使用 SQL 需要执行模型生成的 SQL 查询。使用类似 Pandas 的库则需要让模型执行 Python 代码。由于严格限制 SQL 连接权限和清理 SQL 查询比沙箱化 Python 环境更容易,我们强烈建议通过 SQL 与 CSV 数据进行交互。 有关一般安全最佳实践的更多信息,请参见此处.
设置
本指南的依赖项:
%pip install -qU langchain langchain-openai langchain-community langchain-experimental pandas
设置所需的环境变量:
# 推荐使用 LangSmith,但不是必需的。取消注释以下行以使用。
# import os
# os.environ["LANGCHAIN_TRACING_V2"] = "true"
# os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()
如果您还没有下载 Titanic 数据集,请下载:
!wget https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv -O titanic.csv
import pandas as pd
df = pd.read_csv("titanic.csv")
print(df.shape)
print(df.columns.tolist())
(887, 8)
['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare']
SQL
使用 SQL 与 CSV 数据交互是推荐的方法,因为与任意 Python 相比,它更容易限制权限和清理查询。
大多数 SQL 数据库都可以轻松地将 CSV 文件加载为表格(DuckDB,SQLite 等)。完成此操作后,您可以使用 SQL 教程 中概述的所有链和代理创建技术。以下是如何使用 SQLite 进行此操作的快速示例:
from langchain_community.utilities import SQLDatabase
from sqlalchemy import create_engine
engine = create_engine("sqlite:///titanic.db")
df.to_sql("titanic", engine, index=False)
887
db = SQLDatabase(engine=engine)
print(db.dialect)
print(db.get_usable_table_names())
print(db.run("SELECT * FROM titanic WHERE Age < 2;"))
sqlite
['titanic']
[(1, 2, 'Master. Alden Gates Caldwell', 'male', 0.83, 0, 2, 29.0), (0, 3, 'Master. Eino Viljami Panula', 'male', 1.0, 4, 1, 39.6875), (1, 3, 'Miss. Eleanor Ileen Johnson', 'female', 1.0, 1, 1, 11.1333), (1, 2, 'Master. Richard F Becker', 'male', 1.0, 2, 1, 39.0), (1, 1, 'Master. Hudson Trevor Allison', 'male', 0.92, 1, 2, 151.55), (1, 3, 'Miss. Maria Nakid', 'female', 1.0, 0, 2, 15.7417), (0, 3, 'Master. Sidney Leonard Goodwin', 'male', 1.0, 5, 2, 46.9), (1, 3, 'Miss. Helene Barbara Baclini', 'female', 0.75, 2, 1, 19.2583), (1, 3, 'Miss. Eugenie Baclini', 'female', 0.75, 2, 1, 19.2583), (1, 2, 'Master. Viljo Hamalainen', 'male', 0.67, 1, 1, 14.5), (1, 3, 'Master. Bertram Vere Dean', 'male', 1.0, 1, 2, 20.575), (1, 3, 'Master. Assad Alexander Thomas', 'male', 0.42, 0, 1, 8.5167), (1, 2, 'Master. Andre Mallet', 'male', 1.0, 0, 2, 37.0042), (1, 2, 'Master. George Sibley Richards', 'male', 0.83, 1, 1, 18.75)]
并创建一个 SQL 代理 与之交互:
- OpenAI
- Anthropic
- Azure
- Cohere
- NVIDIA
- FireworksAI
- Groq
- MistralAI
- TogetherAI
pip install -qU langchain-openai
import getpass
import os
os.environ["OPENAI_API_KEY"] = getpass.getpass()
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")
pip install -qU langchain-anthropic
import getpass
import os
os.environ["ANTHROPIC_API_KEY"] = getpass.getpass()
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model="claude-3-5-sonnet-20240620")
pip install -qU langchain-openai
import getpass
import os
os.environ["AZURE_OPENAI_API_KEY"] = getpass.getpass()
from langchain_openai import AzureChatOpenAI
llm = AzureChatOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
azure_deployment=os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"],
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
)
pip install -qU langchain-google-vertexai
import getpass
import os
os.environ["GOOGLE_API_KEY"] = getpass.getpass()
from langchain_google_vertexai import ChatVertexAI
llm = ChatVertexAI(model="gemini-1.5-flash")
pip install -qU langchain-cohere
import getpass
import os
os.environ["COHERE_API_KEY"] = getpass.getpass()
from langchain_cohere import ChatCohere
llm = ChatCohere(model="command-r-plus")
pip install -qU langchain-nvidia-ai-endpoints
import getpass
import os
os.environ["NVIDIA_API_KEY"] = getpass.getpass()
from langchain import ChatNVIDIA
llm = ChatNVIDIA(model="meta/llama3-70b-instruct")
pip install -qU langchain-fireworks
import getpass
import os
os.environ["FIREWORKS_API_KEY"] = getpass.getpass()
from langchain_fireworks import ChatFireworks
llm = ChatFireworks(model="accounts/fireworks/models/llama-v3p1-70b-instruct")
pip install -qU langchain-groq
import getpass
import os
os.environ["GROQ_API_KEY"] = getpass.getpass()
from langchain_groq import ChatGroq
llm = ChatGroq(model="llama3-8b-8192")
pip install -qU langchain-mistralai
import getpass
import os
os.environ["MISTRAL_API_KEY"] = getpass.getpass()
from langchain_mistralai import ChatMistralAI
llm = ChatMistralAI(model="mistral-large-latest")
pip install -qU langchain-openai
import getpass
import os
os.environ["TOGETHER_API_KEY"] = getpass.getpass()
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="https://api.together.xyz/v1",
api_key=os.environ["TOGETHER_API_KEY"],
model="mistralai/Mixtral-8x7B-Instruct-v0.1",
)
from langchain_community.agent_toolkits import create_sql_agent
agent_executor = create_sql_agent(llm, db=db, agent_type="openai-tools", verbose=True)
agent_executor.invoke({"input": "what's the average age of survivors"})
[1m> Entering new SQL Agent Executor chain...[0m
[32;1m[1;3m
Invoking: `sql_db_list_tables` with `{}`
[0m[38;5;200m[1;3mtitanic[0m[32;1m[1;3m
Invoking: `sql_db_schema` with `{'table_names': 'titanic'}`
[0m[33;1m[1;3m
CREATE TABLE titanic (
"Survived" BIGINT,
"Pclass" BIGINT,
"Name" TEXT,
"Sex" TEXT,
"Age" FLOAT,
"Siblings/Spouses Aboard" BIGINT,
"Parents/Children Aboard" BIGINT,
"Fare" FLOAT
)
/*
3 rows from titanic table:
Survived Pclass Name Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare
0 3 Mr. Owen Harris Braund male 22.0 1 0 7.25
1 1 Mrs. John Bradley (Florence Briggs Thayer) Cumings female 38.0 1 0 71.2833
1 3 Miss. Laina Heikkinen female 26.0 0 0 7.925
*/[0m[32;1m[1;3m
Invoking: `sql_db_query` with `{'query': 'SELECT AVG(Age) AS Average_Age FROM titanic WHERE Survived = 1'}`
[0m[36;1m[1;3m[(28.408391812865496,)][0m[32;1m[1;3mTitanic 数据集中幸存者的平均年龄约为 28.41 岁。[0m
[1m> Finished chain.[0m
{'input': "what's the average age of survivors",
'output': 'Titanic 数据集中幸存者的平均年龄约为 28.41 岁。'}
这种方法很容易推广到多个 CSV,因为我们只需将每个 CSV 加载到数据库中作为其自己的表。请参见下面的 多个 CSV 部分。
Pandas
除了 SQL,我们还可以使用数据分析库如 pandas 和 LLM 的代码生成能力来与 CSV 数据进行交互。同样,除非您有广泛的安全保障措施,否则这种方法不适合生产用例。因此,我们的代码执行工具和构造函数位于 langchain-experimental
包中。
Chain
大多数 LLM 已经接受了足够的 pandas Python 代码训练,因此它们可以通过简单的请求生成代码:
ai_msg = llm.invoke(
"I have a pandas DataFrame 'df' with columns 'Age' and 'Fare'. Write code to compute the correlation between the two columns. Return Markdown for a Python code snippet and nothing else."
)
print(ai_msg.content)
```python
correlation = df['Age'].corr(df['Fare'])
correlation
我们可以将这种能力与一个执行 Python 代码的工具结合起来,创建一个简单的数据分析链。我们首先需要将 CSV 表加载为数据框,并让工具访问这个数据框:
```python
import pandas as pd
from langchain_core.prompts import ChatPromptTemplate
from langchain_experimental.tools import PythonAstREPLTool
df = pd.read_csv("titanic.csv")
tool = PythonAstREPLTool(locals={"df": df})
tool.invoke("df['Fare'].mean()")
32.30542018038331
为了帮助确保我们正确使用 Python 工具,我们将使用 工具调用:
llm_with_tools = llm.bind_tools([tool], tool_choice=tool.name)
response = llm_with_tools.invoke(
"I have a dataframe 'df' and want to know the correlation between the 'Age' and 'Fare' columns"
)
response
AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_SBrK246yUbdnJemXFC8Iod05', 'function': {'arguments': '{"query":"df.corr()[\'Age\'][\'Fare\']"}', 'name': 'python_repl_ast'}, 'type': 'function'}]}, response_metadata={'token_usage': {'completion_tokens': 13, 'prompt_tokens': 125, 'total_tokens': 138}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': 'fp_3b956da36b', 'finish_reason': 'stop', 'logprobs': None}, id='run-1fd332ba-fa72-4351-8182-d464e7368311-0', tool_calls=[{'name': 'python_repl_ast', 'args': {'query': "df.corr()['Age']['Fare']"}, 'id': 'call_SBrK246yUbdnJemXFC8Iod05'}])
response.tool_calls
[{'name': 'python_repl_ast',
'args': {'query': "df.corr()['Age']['Fare']"},
'id': 'call_SBrK246yUbdnJemXFC8Iod05'}]
我们将添加一个工具输出解析器,以提取函数调用作为字典:
from langchain_core.output_parsers.openai_tools import JsonOutputKeyToolsParser
parser = JsonOutputKeyToolsParser(key_name=tool.name, first_tool_only=True)
(llm_with_tools | parser).invoke(
"I have a dataframe 'df' and want to know the correlation between the 'Age' and 'Fare' columns"
)
{'query': "df[['Age', 'Fare']].corr()"}
并结合一个提示,以便我们可以在每次调用时仅指定问题,而无需指定数据框信息:
system = f"""You have access to a pandas dataframe `df`. \
Here is the output of `df.head().to_markdown()`:
{df.head().to_markdown()}
Given a user question, write the Python code to answer it. \
Return ONLY the valid Python code and nothing else. \
Don't assume you have access to any libraries other than built-in Python ones and pandas."""
prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{question}")])
code_chain = prompt | llm_with_tools | parser
code_chain.invoke({"question": "What's the correlation between age and fare"})
{'query': "df[['Age', 'Fare']].corr()"}
最后,我们将添加我们的 Python 工具,以便生成的代码能够被实际执行:
chain = prompt | llm_with_tools | parser | tool
chain.invoke({"question": "What's the correlation between age and fare"})
0.11232863699941621
就这样,我们有了一个简单的数据分析链。我们可以通过查看 LangSmith 跟踪来了解中间步骤:https://smith.langchain.com/public/b1309290-7212-49b7-bde2-75b39a32b49a/r
我们可以在最后添加一个额外的 LLM 调用,以生成一个对话响应,这样我们就不仅仅是用工具输出进行回应。为此,我们希望在提示中添加一个聊天历史 MessagesPlaceholder
:
from operator import itemgetter
from langchain_core.messages import ToolMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import MessagesPlaceholder
from langchain_core.runnables import RunnablePassthrough
system = f"""You have access to a pandas dataframe `df`. \
Here is the output of `df.head().to_markdown()`:
{df.head().to_markdown()}
Given a user question, write the Python code to answer it. \
Don't assume you have access to any libraries other than built-in Python ones and pandas.
Respond directly to the question once you have enough information to answer it."""
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
system,
),
("human", "{question}"),
# This MessagesPlaceholder allows us to optionally append an arbitrary number of messages
# at the end of the prompt using the 'chat_history' arg.
MessagesPlaceholder("chat_history", optional=True),
]
)
def _get_chat_history(x: dict) -> list:
"""Parse the chain output up to this point into a list of chat history messages to insert in the prompt."""
ai_msg = x["ai_msg"]
tool_call_id = x["ai_msg"].additional_kwargs["tool_calls"][0]["id"]
tool_msg = ToolMessage(tool_call_id=tool_call_id, content=str(x["tool_output"]))
return [ai_msg, tool_msg]
chain = (
RunnablePassthrough.assign(ai_msg=prompt | llm_with_tools)
.assign(tool_output=itemgetter("ai_msg") | parser | tool)
.assign(chat_history=_get_chat_history)
.assign(response=prompt | llm | StrOutputParser())
.pick(["tool_output", "response"])
)
chain.invoke({"question": "What's the correlation between age and fare"})
{'tool_output': 0.11232863699941616,
'response': 'The correlation between age and fare is approximately 0.1123.'}
这是该运行的 LangSmith 跟踪:https://smith.langchain.com/public/14e38d70-45b1-4b81-8477-9fd2b7c07ea6/r
代理
对于复杂问题,LLM能够在保持先前执行的输入和输出的同时,迭代执行代码是非常有帮助的。这就是代理的作用。它们允许LLM决定需要调用工具的次数,并跟踪到目前为止的执行情况。create_pandas_dataframe_agent是一个内置代理,使得处理数据框变得简单:
from langchain_experimental.agents import create_pandas_dataframe_agent
agent = create_pandas_dataframe_agent(llm, df, agent_type="openai-tools", verbose=True)
agent.invoke(
{
"input": "What's the correlation between age and fare? is that greater than the correlation between fare and survival?"
}
)
[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `python_repl_ast` with `{'query': "df[['Age', 'Fare']].corr().iloc[0,1]"}`
[0m[36;1m[1;3m0.11232863699941621[0m[32;1m[1;3m
Invoking: `python_repl_ast` with `{'query': "df[['Fare', 'Survived']].corr().iloc[0,1]"}`
[0m[36;1m[1;3m0.2561785496289603[0m[32;1m[1;3mThe correlation between Age and Fare is approximately 0.112, and the correlation between Fare and Survival is approximately 0.256.
Therefore, the correlation between Fare and Survival (0.256) is greater than the correlation between Age and Fare (0.112).[0m
[1m> Finished chain.[0m
{'input': "What's the correlation between age and fare? is that greater than the correlation between fare and survival?",
'output': 'The correlation between Age and Fare is approximately 0.112, and the correlation between Fare and Survival is approximately 0.256.\n\nTherefore, the correlation between Fare and Survival (0.256) is greater than the correlation between Age and Fare (0.112).'}
这是此次运行的LangSmith跟踪:https://smith.langchain.com/public/6a86aee2-4f22-474a-9264-bd4c7283e665/r
多个 CSV 文件
要处理多个 CSV 文件(或数据框),我们只需将多个数据框传递给我们的 Python 工具。我们的 create_pandas_dataframe_agent
构造函数可以直接实现这一点,我们可以传入一个数据框列表,而不仅仅是一个。如果我们自己构建链,可以这样做:
df_1 = df[["Age", "Fare"]]
df_2 = df[["Fare", "Survived"]]
tool = PythonAstREPLTool(locals={"df_1": df_1, "df_2": df_2})
llm_with_tool = llm.bind_tools(tools=[tool], tool_choice=tool.name)
df_template = """```python
{df_name}.head().to_markdown()
>>> {df_head}
```"""
df_context = "\n\n".join(
df_template.format(df_head=_df.head().to_markdown(), df_name=df_name)
for _df, df_name in [(df_1, "df_1"), (df_2, "df_2")]
)
system = f"""You have access to a number of pandas dataframes. \
Here is a sample of rows from each dataframe and the python code that was used to generate the sample:
{df_context}
Given a user question about the dataframes, write the Python code to answer it. \
Don't assume you have access to any libraries other than built-in Python ones and pandas. \
Make sure to refer only to the variables mentioned above."""
prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{question}")])
chain = prompt | llm_with_tool | parser | tool
chain.invoke(
{
"question": "return the difference in the correlation between age and fare and the correlation between fare and survival"
}
)
0.14384991262954416
这是此运行的 LangSmith 跟踪记录: https://smith.langchain.com/public/cc2a7d7f-7c5a-4e77-a10c-7b5420fcd07f/r
沙箱代码执行
有许多工具,如 E2B 和 Bearly,提供沙箱环境以执行 Python 代码,从而允许更安全的代码执行链和代理。
下一步
对于更高级的数据分析应用,我们建议查看:
- SQL 教程:处理 SQL 数据库和 CSV 文件时面临的许多挑战是任何结构化数据类型的通用问题,因此即使您使用 Pandas 进行 CSV 数据分析,阅读 SQL 技巧也是有用的。
- 工具使用:关于在使用调用工具的链和代理时的一般最佳实践的指南
- 代理:了解构建 LLM 代理的基本原理。
- 集成:如 E2B 和 Bearly 的沙箱环境,实用工具如 SQLDatabase,相关代理如 Spark DataFrame 代理。