如何从多行字符串中提取特定信息

Question

我已经从电子邮件正文到Python字符串中提取了一些与发票相关的信息，我的下一个任务是从字符串中提取发票编号。 电子邮件的格式可能会有所不同，因此从文本中查找发票编号变得越来越困难。 我还尝试了SpaCy的“命名实体识别”，但是由于在大多数情况下，发票号是从标题“发票”或“发票号”的下一行输入的，所以NER无法理解该关系并返回错误的详细信息。

下面是从邮件正文中提取的文本的两个示例：

示例-1。

Dear Customer:
The past due invoices listed below are still pending. This includes the 
following:

Invoice   Date     Purchase Order  Due Date  Balance
8754321   8/17/17  7200016508      9/16/18   140.72
5245344   11/7/17  4500199620      12/7/18   301.54

We would appreciate quick payment of these invoices.

示例-2。

Hi - please confirm the status of below two invoices.

Invoice#               Amount               Invoice Date       Due Date          
7651234                $19,579.06          29-Jan-19           28-Apr-19            
9872341                $47,137.20          27-Feb-19           26-Apr-19

我的问题是，如果我将整个文本转换为单个字符串，那么它将变成这样：

Invoice   Date     Purchase Order  Due Date  Balance 8754321   8/17/17 
7200016508     9/16/18   140.72

可以看到，发票号（在这种情况下为8754321）已更改了位置，并且不再使用关键字“发票”，因此更难找到。

我想要的输出是这样的：

Output Example - 1 - 

8754321
5245344

Output Example - 2 - 

7651234                
9872341

我不知道如何在关键字“发票”或“发票号” （即发票编号） 下检索文本 。

如果需要更多信息，请告诉我。 谢谢！！

编辑：发票编号没有任何预定义的长度，它可以是7位或更多。

Answer 1

根据我的评论编写代码。

email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the 
following:

Invoice   Date     Purchase Order  Due Date  Balance
8754321   8/17/17  7200016508      9/16/18   140.72
5245344   11/7/17  4500199620      12/7/18   301.54

We would appreciate quick payment of these invoices.'''

index = -1
# Get first line of table, print line and index of 'Invoice'
for line in email.split('\n'):
    if all(x != x.lower() for x in line.split()) and ('Invoice' in line) and len(line) > 0:
        print('--->', line, ' --- index of Invoice:', line.find('Invoice'))
        index = line.find('Invoice')

使用试探法，使列标题行始终为驼峰式或大写（ID）。 如果标题恰好是“帐号”，则此操作将失败。 而不是“帐号”

# get all number at a certain index
for line in email.split('\n'):
     words = line[index:].split()
     if words == []: continue
     word = words[0]
     try:
         print(int(word))
     except:
         continue

这里的可靠性取决于数据。 因此，在我的代码中，“发票”列必须位于表标题的第一位。 即您不能在“发票”之前输入“发票日期”。 显然，这需要修复。

Answer 2

只要这两个假设成立，就可以接受安德鲁·艾伦所说的话：

发票编号始终完全是7个数字
发票编号始终在空格后面，然后是空格

使用正则表达式应该可以工作。 某种东西

import re

email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the 
following:

Invoice   Date     Purchase Order  Due Date  Balance
8754321   8/17/17  7200016508      9/16/18   140.72
5245344   11/7/17  4500199620      12/7/18   301.54

We would appreciate quick payment of these invoices.'''

invoices = re.findall(r'\s(\d\d\d\d\d\d\d)\s', email)

在这种情况下， invoice包含2个字符串的列表， ['8754321', '5245344']

Answer 3

使用正则表达式。 re.findall

例如：

import re

email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the 
following:

Invoice   Date     Purchase Order  Due Date  Balance
8754321   8/17/17  7200016508      9/16/18   140.72
5245344   11/7/17  4500199620      12/7/18   301.54

We would appreciate quick payment of these invoices.'''

email2 = """Hi - please confirm the status of below two invoices.

Invoice#               Amount               Invoice Date       Due Date          
7651234                $19,579.06          29-Jan-19           28-Apr-19            
9872341                $47,137.20          27-Feb-19           26-Apr-19 """

for eml in [email, email2]:
    print(re.findall(r"\b\d{7}\b", eml, flags=re.DOTALL))

输出：

['8754321', '5245344']
['7651234', '9872341']

\\b正则表达式边界
\\d{7} -获取7位数字

如何从多行字符串中提取特定信息

问题描述

3 个解决方案

解决方案1
2 已采纳 2019-05-09 07:49:57

解决方案2
1 2019-05-08 13:51:52

解决方案3
1 2019-05-08 14:12:15

如何从多行字符串中提取特定信息

问题描述

3 个解决方案

解决方案1 2 已采纳 2019-05-09 07:49:57

解决方案2 1 2019-05-08 13:51:52

解决方案3 1 2019-05-08 14:12:15

解决方案1
2 已采纳 2019-05-09 07:49:57

解决方案2
1 2019-05-08 13:51:52

解决方案3
1 2019-05-08 14:12:15