讀取目錄中的所有文件並輸出其中包含某些正則表達式的文件

Question

我正在嘗試讀取目錄中的所有文件，並輸出包含正則表達式的文件以及每個文件中的正則表達式。

 import glob
import re
import PyPDF2
#-------------------------------------------------Input----------------------------------------------------------------------------------------------
folder_path = "/home/"
file_pattern = "/*"
folder_contents = glob.glob(folder_path + file_pattern)

#Search for Emails
regex1= re.compile(r'\S+@\S+')
#Search for Phone Numbers
regex2 = re.compile(r'\d\d\d[-]\d\d\d[-]\d\d\d\d')

match_list=[]

for file in folder_contents:

    if re.search(r".*(?=pdf$)",file):
        #this is pdf
        with open(file, 'rb') as pdfFileObj:
            pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
            pageObj = pdfReader.getPage(0)  
            content = pageObj.extractText()
            read_file = open(file,'rb')
            #print("{}".format(file))

    elif re.search(r".*(?=csv$)",file):
        #this is csv
        with open(file,"r+",encoding="utf-8") as csv:
            read_file = csv.read()
            #print("{}".format(file))
    elif re.search(r"/jupyter",file):
        print("wow")
    elif re.search(r"/scikit",file):
        print("wow")
    else:
        read_file = open(file, 'rb').read()
       #print("{}".format(file))
        continue
    if regex1.findall(read_file) or regex2.findall(read_file):
                print(read_file)

我設法寫了下面的代碼，但它給出了以下錯誤：

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-39-f614d35e0441> in <module>()
     38        #print("{}".format(file))
     39         continue
---> 40     if regex1.findall(read_file) or regex2.findall(read_file):
     41                 print(read_file)

TypeError: expected string or bytes-like object

有什么辦法可以使它正常工作而不會出錯？

Answer 1

以此替換您的讀取文件代碼：

with open(File, mode='rb') as file:
    readFile = file.read()

Answer 2

使用read()僅open(filename)將起作用。 只需替換為這個，您就可以解決問題。

read_file = open(file).read()

Answer 3

首先，我向其他回答這個問題的人表示歉意，因為我會說一些關於OP以前的問題。

關於OP，您不應無意識地復制代碼。

Content是您已經閱讀的頁面。 這意味着您的代碼應為read_file = content 。 以及為什么我編寫read_file = # ，因為我認為您將添加額外的代碼。 但它不應再次讀取同一文件。

with open(file, 'rb') as pdfFileObj:
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
        pageObj = pdfReader.getPage(0)  
        content = pageObj.extractText()
        read_file = open(file,'rb') 
        #^---^---^ according to your former question, `read_file` should  be `content`

而且還會出現其他問題。 您應該在print("wow")之后添加continue 。

elif re.search(r"/jupyter",file):
    print("wow")
elif re.search(r"/scikit",file):
    print("wow")

否則您的代碼將繼續運行，然后發生錯誤。 因為你什么都沒讀。

if regex1.findall(read_file) or regex2.findall(read_file):
    print(read_file)

讀取目錄中的所有文件並輸出其中包含某些正則表達式的文件

問題描述

3 個解決方案

解決方案1
0 2018-12-03 19:30:29

解決方案2
0 2018-12-03 19:31:54

解決方案3
0 2018-12-04 06:24:52

讀取目錄中的所有文件並輸出其中包含某些正則表達式的文件

問題描述

3 個解決方案

解決方案1 0 2018-12-03 19:30:29

解決方案2 0 2018-12-03 19:31:54

解決方案3 0 2018-12-04 06:24:52

解決方案1
0 2018-12-03 19:30:29

解決方案2
0 2018-12-03 19:31:54

解決方案3
0 2018-12-04 06:24:52