简体   繁体   English

如何从python中的word文档中提取文本? (并将数据放入 df)

[英]How to extract text from a word document in python? (and put the data in df)

I have a big list of folders and files (.docx documents).我有一大堆文件夹和文件(.docx 文档)。 So, what I want to do is to create a df with four columns containing the name of those folders and files, but also I want to extract two values that are inside the word documents.所以,我想做的是创建一个包含四列的 df,其中包含这些文件夹和文件的名称,而且我还想提取 word 文档中的两个值。 Then the df should have 4 columns: folder, file, value, and date.然后 df 应该有 4 列:文件夹、文件、值和日期。

I already managed to put the name of the folders and the docx files in a df as shown in the following code.我已经设法将文件夹的名称和 docx 文件放在 df 中,如以下代码所示。

# imports
import os
import pandas as pd

path = ''

data = []
for folder in sorted(os.listdir(path)):
    if folder.startswith('HH'):
        for file in sorted(os.listdir(path + '/' + folder)):
            if file.endswith('.docx'):
                data.append((folder, file))

df = pd.DataFrame(data, columns=['Folder', 'File_name'])
df

However, I cant find the way to get the values I want from the.docx files.但是,我无法找到从 .docx 文件中获取所需值的方法。 I tried first to do it separately like this:我首先尝试像这样单独进行:

# Import the module
import docx2txt

path2 = ''
# Open the .docx file
document = docx2txt.process(path2)

document

I got this result: 'Property Nr: \tTEST\n\nProperty Comments\t\t\t\n\n\t\t \n\n\n\n\n\n\n\n\n\nReinstatement value \t\n\nEuro __ 191,250.00 excl VAT\n\n\t\t\n\nReinstatement value \t\n\nEuro __ 191,250.00 excl VAT\n\n\t\t\n\n\n\n\n\n\n\n\n\nSigned:\n\n________________________________\n\nPerit TEST\n\nDate: 24th June 2021\n\n\n\nSigned:\n\n________________________________\n\nTEST\n\nDate: 24th June 2021'我得到了这个结果: 'Property Nr: \tTEST\n\nProperty Comments\t\t\t\n\n\t\t \n\n\n\n\n\n\n\n\n\nReinstatement value \t\n\nEuro __ 191,250.00 excl VAT\n\n\t\t\n\nReinstatement value \t\n\nEuro __ 191,250.00 excl VAT\n\n\t\t\n\n\n\n\n\n\n\n\n\nSigned:\n\n________________________________\n\nPerit TEST\n\nDate: 24th June 2021\n\n\n\nSigned:\n\n________________________________\n\nTEST\n\nDate: 24th June 2021'

The two values I want are:我想要的两个值是:

  1. The number in Euro __ 191,250.00 Euro __ 191,250.00
  2. The date: 24th June 2021日期: 24th June 2021

I would really appreciate if you could help me at least to get the values.如果您至少可以帮助我获得价值,我将不胜感激。 Thanks谢谢

You can use re.search() .您可以使用re.search()
If your document is str type, try out the following code.如果你的文档是str类型,试试下面的代码。

import re

value_match = re.search('Euro __ (.*)excl', document)
value = value_match.group(1).strip()

date_match = re.search('Date:(.*)', document)
date = date_match.group(1).strip()

print(f"Value: {value}, Date: {date}")

Output: Output:

Value: 191,250.00, Date: 24th June 2021

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM