如何从带有 python 的文件夹中的 pdf 中提取文本并将它们保存在 dataframe 中？

Question

I have many folders where each has a couple of pdf files (other file types like.xlsx or.doc are there as well).我有很多文件夹，每个文件夹都有几个 pdf 文件（其他文件类型，如 .xlsx 或 .doc 也有）。 My goal is to extract the pdf's text for each folder and create a data frame where each record is the "Folder Name" and each column represents text content of each pdf file in that folder in string form.我的目标是提取每个文件夹的 pdf 文本并创建一个数据框，其中每条记录都是“文件夹名称”，每列以字符串形式表示该文件夹中每个 pdf 文件的文本内容。

I managed to extract text from one pdf file with tika package (code below).我设法使用tika package（代码如下）从一个 pdf 文件中提取文本。 But can not make a loop to iterate on other pdfs in the folder or other folders so to construct a structured dataframe.但不能循环迭代文件夹或其他文件夹中的其他pdf，以便构造结构化的dataframe。

 # import parser object from tike from tika import parser # opening pdf file parsed_pdf = parser.from_file("ducument_1.pdf") # saving content of pdf # you can also bring text only, by parsed_pdf['text'] # parsed_pdf['content'] returns string data = parsed_pdf['content'] # Printing of content print(data) # <class 'str'> print(type(data))

The desired output should look like this:所需的 output 应如下所示：

Folder_Name文件夹名称	pdf1 pdf1	pdf2 pdf2
17534 17534	text of the pdf1 pdf1的文本	text of the pdf 2 pdf 2 的文本
63546 63546	text of the pdf1 pdf1的文本	text of the pdf1 pdf1的文本
26374 26374	text of the pdf1 pdf1的文本	- -

Answer 1

If you want to find all the PDFs in a directory and its subdirectories, you can use os.listdir and glob , see Recursive sub folder search and return files in a list python .如果要查找目录及其子目录中的所有 PDF，可以使用os.listdir和glob ，请参阅递归子文件夹搜索并在列表中返回文件 python 。 I've gone for a slightly longer form so it is easier to follow what is happening for beginners我已经选择了一个稍长的表格，所以更容易理解初学者正在发生的事情

Then, for each file, call Apache Tika, and save to the next row in the Pandas DataFrame然后，对于每个文件，调用 Apache Tika，并保存到 Pandas DataFrame 中的下一行

#!/usr/bin/python3

import os, glob
from tika import parser 
from pandas import DataFrame

# What file extension to find, and where to look from
ext = "*.pdf"
PATH = "."

# Find all the files with that extension
files = []
for dirpath, dirnames, filenames in os.walk(PATH):
    files += glob.glob(os.path.join(dirpath, ext))

# Create a Pandas Dataframe to hold the filenames and the text
df = DataFrame(columns=("filename","text"))

# Process each file in turn, parsing with Tika and storing in the dataframe
for idx, filename in enumerate(files):
   data = parser.from_file(filename)
   text = data["content"]
   df.loc[idx] = [filename, text]

# For debugging, print what we found
print(df)

Answer 2

Extremely easy to have a list of all pdfs on unix.非常容易获得 unix 上所有 pdf 的列表。

import os

# saves all pdf in a string.
a = os.popen("du -a|awk '{print $2}'|grep '.*\.pdf$'").read()[2:-1]
print(a)

On my computer the output was:在我的电脑上，output 是：

[luca@artix tmp]$ python3 forum.py
a.pdf
./foo/test.pdf

You can just do something like你可以做类似的事情

for line in a.split('\n'):
    print(line, line.split('/'))

and you'll know the folder of the pdf.你会知道 pdf 的文件夹。 I hope I helped you我希望我帮助了你

如何从带有 python 的文件夹中的 pdf 中提取文本并将它们保存在 dataframe 中？

问题描述

2 个解决方案

解决方案1
3 已采纳 2021-02-16 14:48:36

解决方案2
0 2021-02-16 12:52:53

如何从带有 python 的文件夹中的 pdf 中提取文本并将它们保存在 dataframe 中？

问题描述

2 个解决方案

解决方案1 3 已采纳 2021-02-16 14:48:36

解决方案2 0 2021-02-16 12:52:53

解决方案1
3 已采纳 2021-02-16 14:48:36

解决方案2
0 2021-02-16 12:52:53