简体   繁体   English

如何从python中的文件夹一一读取pdf文件

[英]How to read pdf files one by one from a folder in python

I am reading pdf files and trying to extract keywords from them through NLP techniques.Right now the program accepts one pdf at a time. 我正在阅读pdf文件,并尝试通过NLP技术从中提取关键字。目前,该程序一次只能接受一个pdf。 I have a folder say in D drive named 'pdf_docs'. 我在D驱动器中有一个文件夹说“ pdf_docs”。 The folder contains many pdf documents. 该文件夹包含许多pdf文档。 My goal is to read each pdf file one by one from the folder. 我的目标是从文件夹中逐个读取每个pdf文件。 How can I do that in python. 我如何在python中做到这一点。 The code so far working successfully is like below. 到目前为止,成功运行的代码如下所示。

import PyPDF2

file = open('abc.pdf','rb')


fileReader = PyPDF2.PdfFileReader(file)

count = 0

while count < 3:

    pageObj = fileReader.getPage(count)
    count +=1
    text = pageObj.extractText()

you can use glob in order use pattern matching for getting a list of all pdf files in your directory. 您可以使用glob以便使用模式匹配来获取目录中所有pdf文件的列表。

import glob

pdf_dir = "/foo/dir"

pdf_files = glob.glob("%s/*.pdf" % pdf_dir)
for file in pdf_files:
    do_your_stuff()

First read all files that are available under that directory 首先读取该目录下所有可用的文件

from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]

And then run your code for each file in that list 然后为该列表中的每个文件运行代码

import PyPDF2
from os import listdir
from os.path import isfile, join


onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
for file in onlyfiles:
    fileReader = PyPDF2.PdfFileReader(open(file,'rb'))

    count = 0

    while count < 3:

        pageObj = fileReader.getPage(count)
        count +=1
        text = pageObj.extractText()

os.listdir() will get you everything that's in a directory - files and directories. os.listdir()将为您提供目录中的所有内容-文件和目录。 So be careful to have only pdf files in your path or you will need to implement simple filtration for list. 因此请注意,路径中仅包含pdf文件,否则您将需要对列表进行简单过滤。

Edit 1 编辑1

You can also use glob module, as it does pattern matching. 您也可以使用glob模块,因为它可以进行模式匹配。

>>> import glob
>>> print(glob.glob('/home/rszamszur/*.sh'))
['/home/rszamszur/work-monitors.sh', '/home/rszamszur/default-monitor.sh', '/home/rszamszur/home-monitors.sh']

Key difference between OS module and glob is that OS will work for all systems, where glob only for Unix like. OS模块和glob之间的主要区别在于OS将适用于所有系统,而glob仅适用于Unix。

import PyPDF2
import re
import glob

#your full path of directory
mypath = "dir"
for file in glob.glob(mypath + "/*.pdf"):
    print(file)
    if file.endswith('.pdf'):
        fileReader = PyPDF2.PdfFileReader(open(file, "rb"))
        count = 0
        count = fileReader.numPages
        while count >= 0:
            count -= 1
            pageObj = fileReader.getPage(count)
            text = pageObj.extractText()
            print(text)
        num = re.findall(r'[0-9]+', text)
        print(num)
    else:
        print("not in format")

Let's go through the code: In python we can't handle Pdf files normally. 让我们看一下代码:在python中,我们无法正常处理Pdf文件。 so we need to install PyPDF2 package then import the package. 所以我们需要安装PyPDF2软件包,然后导入该软件包。 "glob" function is used to read the files inside the directory. “ glob”功能用于读取目录内的文件。 using "for" loop to get the files inside the folder. 使用“ for”循环获取文件夹中的文件。 now check the file type is it in pdf format or not by using "if" condition. 现在使用“ if”条件检查文件类型是否为pdf格式。 now we are reading the pdf files in the folder using "PdfFileReader"function. 现在,我们正在使用“ PdfFileReader”功能读取文件夹中的pdf文件。 then getting number of pages in the pdf document. 然后获取pdf文档中的页数。 By using while loop to getting all pages and print all text in the file. 通过使用while循环来获取所有页面并打印文件中的所有文本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从一个文件夹中一个一个读取多个pdf - How to read multiple pdf from a folder one by one 如何将文件从一个文件夹从python移动到另一个文件夹 - How to move Files from one folder to another folder from python 如何从包含 python 中的多个 csv 文件的文件夹中一次读取一个文件 - How to read one file at a time from folder that contains multiple csv files in python 如何从一个文件夹中读取多个txt文件 - How to read multiple txt files from one folder 读取一个文件夹中的所有文件后,如何在python中从一个文件夹跳转到另一个文件夹? - How to jump from one folder to another folder in python after reading all the files in one folder? 如何使用for循环从python中的一个文件夹中读取多个图像? - how to read multiple images from one folder in python using for loop? 如何将 pdf 文件从 Python 中的临时文件合并为一个文件 - how to combine pdf files into one from a tempfile in Python 有没有一种方法可以使用Python对一个文件夹中的所有pdf文件进行OCR? - Is there a way to OCR all pdf files within one folder using Python? Python - 将文件从一个文件夹移动到另一个文件夹 - Python - move files from one folder to another 将文件从树传输到一个文件夹python - Transfering files from tree to one folder python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM