简体   繁体   English

如何从多行字符串中提取特定信息

[英]How to extract specific information from multi-line string

I have extracted some invoice related information from email body to Python strings, my next task is to extract the Invoice numbers from the string. 我已经从电子邮件正文到Python字符串中提取了一些与发票相关的信息,我的下一个任务是从字符串中提取发票编号。 The format of emails could vary, hence it is getting difficult to find invoice number from the text. 电子邮件的格式可能会有所不同,因此从文本中查找发票编号变得越来越困难。 I also tried "Named Entity Recognition" from SpaCy but since in most of the cases the Invoice number is coming in next line from the heading 'Invoice' or 'Invoice#',the NER doesn't understand the relation and returns incorrect details. 我还尝试了SpaCy的“命名实体识别”,但是由于在大多数情况下,发票号是从标题“发票”或“发票号”的下一行输入的,所以NER无法理解该关系并返回错误的详细信息。

Below are 2 examples of the text extracted from mail body: 下面是从邮件正文中提取的文本的两个示例:

Example - 1. 示例-1。

Dear Customer:
The past due invoices listed below are still pending. This includes the 
following:

Invoice   Date     Purchase Order  Due Date  Balance
8754321   8/17/17  7200016508      9/16/18   140.72
5245344   11/7/17  4500199620      12/7/18   301.54

We would appreciate quick payment of these invoices.

Example - 2. 示例-2。

Hi - please confirm the status of below two invoices.

Invoice#               Amount               Invoice Date       Due Date          
7651234                $19,579.06          29-Jan-19           28-Apr-19            
9872341                $47,137.20          27-Feb-19           26-Apr-19 

My problem is that if I convert this entire text to a single string then this becomes something like this: 我的问题是,如果我将整个文本转换为单个字符串,那么它将变成这样:

Invoice   Date     Purchase Order  Due Date  Balance 8754321   8/17/17 
7200016508     9/16/18   140.72

As it is visible that the Invoice number (8754321 in this case) changed its position and doesn't follow the keyword "Invoice" anymore, which is more difficult to find. 可以看到,发票号(在这种情况下为8754321)已更改了位置,并且不再使用关键字“发票”,因此更难找到。

My desired output is something like this: 我想要的输出是这样的:

Output Example - 1 - 

8754321
5245344

Output Example - 2 - 

7651234                
9872341        

I don't know how can I retrieve text just under keyword "Invoice" or "Invoice#" which is the invoice number. 我不知道如何在关键字“发票”或“发票号” (即发票编号) 下检索文本

Please let me know if further information is required. 如果需要更多信息,请告诉我。 Thanks!! 谢谢!!

Edit: The invoice number doesn't have any pre-defined length, it can be 7 digit or can be more than that. 编辑:发票编号没有任何预定义的长度,它可以是7位或更多。

Code per my comments. 根据我的评论编写代码。

email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the 
following:

Invoice   Date     Purchase Order  Due Date  Balance
8754321   8/17/17  7200016508      9/16/18   140.72
5245344   11/7/17  4500199620      12/7/18   301.54

We would appreciate quick payment of these invoices.'''

index = -1
# Get first line of table, print line and index of 'Invoice'
for line in email.split('\n'):
    if all(x != x.lower() for x in line.split()) and ('Invoice' in line) and len(line) > 0:
        print('--->', line, ' --- index of Invoice:', line.find('Invoice'))
        index = line.find('Invoice')

Uses heuristic that the column header row is always camel case or capitals (ID). 使用试探法,使列标题行始终为驼峰式或大写(ID)。 This would fail if say a heading was exactly 'Account no.' 如果标题恰好是“帐号”,则此操作将失败。 rather than 'Account No.' 而不是“帐号”

# get all number at a certain index
for line in email.split('\n'):
     words = line[index:].split()
     if words == []: continue
     word = words[0]
     try:
         print(int(word))
     except:
         continue

Reliability here depends on data. 这里的可靠性取决于数据。 So in my code Invoice column must be first of table header. 因此,在我的代码中,“发票”列必须位于表标题的第一位。 ie you can't have 'Invoice Date' before 'Invoice'. 即您不能在“发票”之前输入“发票日期”。 Obviously this would need fixing. 显然,这需要修复。

Going off what Andrew Allen was saying, as long as these 2 assumptions are true: 只要这两个假设成立,就可以接受安德鲁·艾伦所说的话:

  1. Invoice numbers are always exactly 7 numerical digits 发票编号始终完全是7个数字
  2. Invoice numbers are always following a whitespace and followed by a whitespace 发票编号始终在空格后面,然后是空格

Using regex should work. 使用正则表达式应该可以工作。 Something along the lines of; 某种东西

import re

email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the 
following:

Invoice   Date     Purchase Order  Due Date  Balance
8754321   8/17/17  7200016508      9/16/18   140.72
5245344   11/7/17  4500199620      12/7/18   301.54

We would appreciate quick payment of these invoices.'''

invoices = re.findall(r'\s(\d\d\d\d\d\d\d)\s', email)

invoice in this case has a list of 2 strings, ['8754321', '5245344'] 在这种情况下, invoice包含2个字符串的列表, ['8754321', '5245344']

Using Regex. 使用正则表达式。 re.findall

Ex: 例如:

import re

email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the 
following:

Invoice   Date     Purchase Order  Due Date  Balance
8754321   8/17/17  7200016508      9/16/18   140.72
5245344   11/7/17  4500199620      12/7/18   301.54

We would appreciate quick payment of these invoices.'''

email2 = """Hi - please confirm the status of below two invoices.

Invoice#               Amount               Invoice Date       Due Date          
7651234                $19,579.06          29-Jan-19           28-Apr-19            
9872341                $47,137.20          27-Feb-19           26-Apr-19 """

for eml in [email, email2]:
    print(re.findall(r"\b\d{7}\b", eml, flags=re.DOTALL))

Output: 输出:

['8754321', '5245344']
['7651234', '9872341']
  • \\b - regex boundaries \\b正则表达式边界
  • \\d{7} - get 7 digit number \\d{7} -获取7位数字

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM