[英]How to extract specific information from multi-line string
我已经从电子邮件正文到Python字符串中提取了一些与发票相关的信息,我的下一个任务是从字符串中提取发票编号。 电子邮件的格式可能会有所不同,因此从文本中查找发票编号变得越来越困难。 我还尝试了SpaCy的“命名实体识别”,但是由于在大多数情况下,发票号是从标题“发票”或“发票号”的下一行输入的,所以NER无法理解该关系并返回错误的详细信息。
下面是从邮件正文中提取的文本的两个示例:
示例-1。
Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.
示例-2。
Hi - please confirm the status of below two invoices.
Invoice# Amount Invoice Date Due Date
7651234 $19,579.06 29-Jan-19 28-Apr-19
9872341 $47,137.20 27-Feb-19 26-Apr-19
我的问题是,如果我将整个文本转换为单个字符串,那么它将变成这样:
Invoice Date Purchase Order Due Date Balance 8754321 8/17/17
7200016508 9/16/18 140.72
可以看到,发票号(在这种情况下为8754321)已更改了位置,并且不再使用关键字“发票”,因此更难找到。
我想要的输出是这样的:
Output Example - 1 -
8754321
5245344
Output Example - 2 -
7651234
9872341
我不知道如何在关键字“发票”或“发票号” (即发票编号) 下检索文本 。
如果需要更多信息,请告诉我。 谢谢!!
编辑:发票编号没有任何预定义的长度,它可以是7位或更多。
根据我的评论编写代码。
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
index = -1
# Get first line of table, print line and index of 'Invoice'
for line in email.split('\n'):
if all(x != x.lower() for x in line.split()) and ('Invoice' in line) and len(line) > 0:
print('--->', line, ' --- index of Invoice:', line.find('Invoice'))
index = line.find('Invoice')
使用试探法,使列标题行始终为驼峰式或大写(ID)。 如果标题恰好是“帐号”,则此操作将失败。 而不是“帐号”
# get all number at a certain index
for line in email.split('\n'):
words = line[index:].split()
if words == []: continue
word = words[0]
try:
print(int(word))
except:
continue
这里的可靠性取决于数据。 因此,在我的代码中,“发票”列必须位于表标题的第一位。 即您不能在“发票”之前输入“发票日期”。 显然,这需要修复。
只要这两个假设成立,就可以接受安德鲁·艾伦所说的话:
使用正则表达式应该可以工作。 某种东西
import re
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
invoices = re.findall(r'\s(\d\d\d\d\d\d\d)\s', email)
在这种情况下, invoice
包含2个字符串的列表, ['8754321', '5245344']
使用正则表达式。 re.findall
例如:
import re
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
email2 = """Hi - please confirm the status of below two invoices.
Invoice# Amount Invoice Date Due Date
7651234 $19,579.06 29-Jan-19 28-Apr-19
9872341 $47,137.20 27-Feb-19 26-Apr-19 """
for eml in [email, email2]:
print(re.findall(r"\b\d{7}\b", eml, flags=re.DOTALL))
输出:
['8754321', '5245344']
['7651234', '9872341']
\\b
正则表达式边界 \\d{7}
-获取7位数字
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.