简体   繁体   English

提取除正则表达式匹配之外的所有内容

[英]Extract everything but a regex match

I have a dataframe where each row contains an e-mail's raw text.我有一个数据框,其中每一行都包含电子邮件的原始文本。 I need to clean up the data to extract the following columns: From, To, CC, Subject, and the body of the text.我需要清理数据以提取以下列:From、To、CC、Subject 和正文。 The e-mails typically look like this:电子邮件通常如下所示:

From   : Vincent Adultman
To     : Business Person, 
Cc     : 
Subject: On the subject of business Transactions

Dear blabla,

We would like to bla bla to improve our bla bla by X%.


Thanks in advance

I was able to extract the first four columns using the following regex expression:我能够使用以下正则表达式提取前四列:

import pandas as pd
df = pd.DataFrame(data=data,columns=['text'],dtype='string')

df['from'] = df.loc[:,'text'].str.extract(pat=r'(\bFrom .+)')
df['to'] = df.loc[:,'text'].str.extract(pat=r'(\bTo .+)')
df['cc'] = df.loc[:,'text'].str.extract(pat=r'(\bCc .+)')
df['bcc'] = df.loc[:,'text'].str.extract(pat=r'(\bBcc .+)')
df['subject'] = df.loc[:,'text'].str.extract(pat=r'(\bSubject: .+)')

Now I am trying to extract the rest of the body that starts at Dear blabla .现在我试图提取从Dear blabla开始的身体的其余部分。 However since every e-mail is different, I can't go matching on Dear blabla .但是,由于每封电子邮件都不同,我无法在Dear blabla上进行匹配。

How can I match all the text except the first four matches I have already done?除了我已经完成的前四个匹配项之外,我如何匹配所有文本?

Here is what I have tried:这是我尝试过的:

df.loc[:,'text'].str.extract(pat=r'^(\bFrom .+|\bTo .+|\bCc .+|Bcc .+|\bSubject .+)')
df.loc[:,'text'].str.extract(pat=r'^[(\bFrom .+|\bTo .+|\bCc .+|Bcc .+|\bSubject .+)]')

What am I doing wrong?我究竟做错了什么?

You can use您可以使用

df['body'] = df['text'].str.replace(r'^(?:\n?(?:From|To|Cc|Subject)\s*:.*)+\s*', '')

See the regex demo .请参阅正则表达式演示

Details细节

  • ^ - start of string ^ - 字符串的开始
  • (?:\\n?(?:From|To|Cc|Subject)\\s*:.*)+ - one or more repetitions of (?:\\n?(?:From|To|Cc|Subject)\\s*:.*)+ - 一次或多次重复
    • \\n? - an optional newline, line feed char - 一个可选的换行符,换行符
    • (?:From|To|Cc|Subject) - either From , or To , Cc , Subject (?:From|To|Cc|Subject) - From , 或To , Cc , Subject
    • \\s*: - 0 or more whitespace chars and a : char \\s*: - 0 个或多个空格字符和一个: char
    • .* - any 0 or more chars other than line break chars, as many as possible .* - 除换行符以外的任何 0 个或更多字符,尽可能多
  • \\s* - 0 or more whitespace chars. \\s* - 0 个或多个空白字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM