简体   繁体   English

如何将文件夹中文档中的文本添加到数组中

[英]How to add text from documents in a folder to an array

Good afternoon.下午好。 Unfortunately, I did not find an answer to a simple question.不幸的是,我没有找到一个简单问题的答案。 I have a document folder.我有一个文件夹。 PDF format. PDF 格式。 I can use Pandas to open one document and add its text to an array.我可以使用 Pandas 打开一个文档并将其文本添加到数组中。 Where the first column is the folder name and the second is the text from the document.其中第一列是文件夹名称,第二列是文档中的文本。 But how do you do this for all documents in a folder?但是如何对文件夹中的所有文档执行此操作? Alas, I don't know.唉,我不知道。

category类别 text文本
test测试 first document第一份文件
test测试 second document第二份文件
test测试 ... ...

Assuming you put the code you already have into a function that takes in a file name and the DataFrame you have so far, it's pretty easy to do what you want:假设您将已经拥有的代码放入 function 中,该 function 接受文件名和 DataFrame 到目前为止,做您想做的事情很容易:

import os
import pandas as pd

dataframe = pd.DataFrame()

files = os.listdir("[path/to/folder/]")

for file in files:
    dataframe = addFileToTable(file, dataFrame)

If you're not sure how to add a new row to the dataframe:如果您不确定如何向 dataframe 添加新行:

def addFileToTable(file, dataframe):
    # Convert PDF to array
    # ...

    row = {"category" : array[0], "text" : array[1]}
    df = dataframe.append(row, ignore_index = True)
    return df

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM