Grobid

GROBID 是一个用于提取、解析和重构原始文档的机器学习库。

它旨在并预计用于解析学术论文，在这方面表现尤为出色。

注意: 如果提供给 Grobid 的文章是大型文档（例如，学位论文），且超过一定数量的元素，它们可能无法被处理。

本页面介绍如何使用 Grobid 为 LangChain 解析文章。

安装

grobid 的安装详细信息请参见 https://grobid.readthedocs.io/en/latest/Install-Grobid/。
然而，通过 docker 容器运行 grobid 可能更简单、更省事，具体文档请参见这里。

使用 Grobid 与 LangChain

一旦安装并运行了 grobid（您可以通过访问 http://localhost:8070 来检查），您就可以开始使用了。

现在您可以使用 GrobidParser 来生成文档

from langchain_community.document_loaders.parsers import GrobidParser
from langchain_community.document_loaders.generic import GenericLoader

#Produce chunks from article paragraphs
loader = GenericLoader.from_filesystem(
    "/Users/31treehaus/Desktop/Papers/",
    glob="*",
    suffixes=[".pdf"],
    parser= GrobidParser(segment_sentences=False)
)
docs = loader.load()

#Produce chunks from article sentences
loader = GenericLoader.from_filesystem(
    "/Users/31treehaus/Desktop/Papers/",
    glob="*",
    suffixes=[".pdf"],
    parser= GrobidParser(segment_sentences=True)
)
docs = loader.load()

块元数据将包括边界框。尽管这些解析起来有点奇怪，但在 https://grobid.readthedocs.io/en/latest/Coordinates-in-PDF/ 中有详细说明。

Grobid

安装

使用 Grobid 与 LangChain

此页面是否有帮助？

您还可以留下详细的反馈在 GitHub 上

Grobid

安装​

使用 Grobid 与 LangChain​

此页面是否有帮助？

您还可以留下详细的反馈 在 GitHub 上

安装

使用 Grobid 与 LangChain

您还可以留下详细的反馈在 GitHub 上