提取除正则表达式匹配之外的所有内容

Question

I have a dataframe where each row contains an e-mail's raw text.我有一个数据框，其中每一行都包含电子邮件的原始文本。 I need to clean up the data to extract the following columns: From, To, CC, Subject, and the body of the text.我需要清理数据以提取以下列：From、To、CC、Subject 和正文。 The e-mails typically look like this:电子邮件通常如下所示：

From   : Vincent Adultman
To     : Business Person, 
Cc     : 
Subject: On the subject of business Transactions

Dear blabla,

We would like to bla bla to improve our bla bla by X%.


Thanks in advance

I was able to extract the first four columns using the following regex expression:我能够使用以下正则表达式提取前四列：

import pandas as pd
df = pd.DataFrame(data=data,columns=['text'],dtype='string')

df['from'] = df.loc[:,'text'].str.extract(pat=r'(\bFrom .+)')
df['to'] = df.loc[:,'text'].str.extract(pat=r'(\bTo .+)')
df['cc'] = df.loc[:,'text'].str.extract(pat=r'(\bCc .+)')
df['bcc'] = df.loc[:,'text'].str.extract(pat=r'(\bBcc .+)')
df['subject'] = df.loc[:,'text'].str.extract(pat=r'(\bSubject: .+)')

Now I am trying to extract the rest of the body that starts at Dear blabla .现在我试图提取从Dear blabla开始的身体的其余部分。 However since every e-mail is different, I can't go matching on Dear blabla .但是，由于每封电子邮件都不同，我无法在Dear blabla上进行匹配。

How can I match all the text except the first four matches I have already done?除了我已经完成的前四个匹配项之外，我如何匹配所有文本？

Here is what I have tried:这是我尝试过的：

df.loc[:,'text'].str.extract(pat=r'^(\bFrom .+|\bTo .+|\bCc .+|Bcc .+|\bSubject .+)')
df.loc[:,'text'].str.extract(pat=r'^[(\bFrom .+|\bTo .+|\bCc .+|Bcc .+|\bSubject .+)]')

What am I doing wrong?我究竟做错了什么？

Answer 1

You can use您可以使用

df['body'] = df['text'].str.replace(r'^(?:\n?(?:From|To|Cc|Subject)\s*:.*)+\s*', '')

See the regex demo .请参阅正则表达式演示。

Details细节

^ - start of string ^ - 字符串的开始
(?:\\n?(?:From|To|Cc|Subject)\\s*:.*)+ - one or more repetitions of (?:\\n?(?:From|To|Cc|Subject)\\s*:.*)+ - 一次或多次重复
- \\n? - an optional newline, line feed char - 一个可选的换行符，换行符
- (?:From|To|Cc|Subject) - either From , or To , Cc , Subject (?:From|To|Cc|Subject) - From , 或To , Cc , Subject
- \\s*: - 0 or more whitespace chars and a : char \\s*: - 0 个或多个空格字符和一个: char
- .* - any 0 or more chars other than line break chars, as many as possible .* - 除换行符以外的任何 0 个或更多字符，尽可能多
\\s* - 0 or more whitespace chars. \\s* - 0 个或多个空白字符。

提取除正则表达式匹配之外的所有内容

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-11-05 12:44:58

提取除正则表达式匹配之外的所有内容

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-11-05 12:44:58

解决方案1
1 已采纳 2020-11-05 12:44:58