简体   繁体   English

如何从带有 python 的文件夹中的 pdf 中提取文本并将它们保存在 dataframe 中?

[英]How to extract text from pdfs in folders with python and save them in dataframe?

I have many folders where each has a couple of pdf files (other file types like.xlsx or.doc are there as well).我有很多文件夹,每个文件夹都有几个 pdf 文件(其他文件类型,如 .xlsx 或 .doc 也有)。 My goal is to extract the pdf's text for each folder and create a data frame where each record is the "Folder Name" and each column represents text content of each pdf file in that folder in string form.我的目标是提取每个文件夹的 pdf 文本并创建一个数据框,其中每条记录都是“文件夹名称”,每列以字符串形式表示该文件夹中每个 pdf 文件的文本内容。

I managed to extract text from one pdf file with tika package (code below).我设法使用tika package(代码如下)从一个 pdf 文件中提取文本。 But can not make a loop to iterate on other pdfs in the folder or other folders so to construct a structured dataframe.但不能循环迭代文件夹或其他文件夹中的其他pdf,以便构造结构化的dataframe。

 # import parser object from tike from tika import parser # opening pdf file parsed_pdf = parser.from_file("ducument_1.pdf") # saving content of pdf # you can also bring text only, by parsed_pdf['text'] # parsed_pdf['content'] returns string data = parsed_pdf['content'] # Printing of content print(data) # <class 'str'> print(type(data))

The desired output should look like this:所需的 output 应如下所示:

Folder_Name文件夹名称 pdf1 pdf1 pdf2 pdf2
17534 17534 text of the pdf1 pdf1的文本 text of the pdf 2 pdf 2 的文本
63546 63546 text of the pdf1 pdf1的文本 text of the pdf1 pdf1的文本
26374 26374 text of the pdf1 pdf1的文本 - -

If you want to find all the PDFs in a directory and its subdirectories, you can use os.listdir and glob , see Recursive sub folder search and return files in a list python .如果要查找目录及其子目录中的所有 PDF,可以使用os.listdirglob ,请参阅递归子文件夹搜索并在列表中返回文件 python I've gone for a slightly longer form so it is easier to follow what is happening for beginners我已经选择了一个稍长的表格,所以更容易理解初学者正在发生的事情

Then, for each file, call Apache Tika, and save to the next row in the Pandas DataFrame然后,对于每个文件,调用 Apache Tika,并保存到 Pandas DataFrame 中的下一行

#!/usr/bin/python3

import os, glob
from tika import parser 
from pandas import DataFrame

# What file extension to find, and where to look from
ext = "*.pdf"
PATH = "."

# Find all the files with that extension
files = []
for dirpath, dirnames, filenames in os.walk(PATH):
    files += glob.glob(os.path.join(dirpath, ext))

# Create a Pandas Dataframe to hold the filenames and the text
df = DataFrame(columns=("filename","text"))

# Process each file in turn, parsing with Tika and storing in the dataframe
for idx, filename in enumerate(files):
   data = parser.from_file(filename)
   text = data["content"]
   df.loc[idx] = [filename, text]

# For debugging, print what we found
print(df)

Extremely easy to have a list of all pdfs on unix.非常容易获得 unix 上所有 pdf 的列表。

import os

# saves all pdf in a string.
a = os.popen("du -a|awk '{print $2}'|grep '.*\.pdf$'").read()[2:-1]
print(a)

On my computer the output was:在我的电脑上,output 是:

[luca@artix tmp]$ python3 forum.py
a.pdf
./foo/test.pdf

You can just do something like你可以做类似的事情

for line in a.split('\n'):
    print(line, line.split('/'))

and you'll know the folder of the pdf.你会知道 pdf 的文件夹。 I hope I helped you我希望我帮助了你

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 python dataframe 中的链接打开、保存和提取文本 PDF - Open, save and extract text PDFs from links in python dataframe 如何使用 pdfminer python 从 pdf 中提取表格文本 - How to extract table text from pdfs using pdfminer python 从文本文件中提取多种模式并将其保存到熊猫数据框[python] - Extract multiple patterns from a text file and save it to a panda dataframe [python] 如何从 url 列表中提取文本并分别保存 - how to extract text from a list of url and save them separately 从扫描的pdf中提取文本 - text extract from scanned pdfs Python:从.txt中提取位置相关的字符串并将它们保存到dataframe的不同列中 - Python: extract position-dependent strings from .txt and save them to different columns of a dataframe Python:如何读取多个文件夹中的所有文本文件内容并将其保存到一个excel文件中 - Python: How to read all text file contents in multiple folders and save them into one excel file 如何从 Python 中的多个 PDF 中提取特定表格 - How to extract specific Tables from multiple PDFs in Python 如何在 python 中使用 PyMuPDF 从非结构化 PDF 中提取数据? - How to Data Extract from Unstructured PDFs using PyMuPDF in python? 循环通过多个 pdf 提取文本并重命名它们 - loop trough multiple pdfs extract the text and rename them
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM