[英]Extract everything but a regex match
I have a dataframe where each row contains an e-mail's raw text.我有一个数据框,其中每一行都包含电子邮件的原始文本。 I need to clean up the data to extract the following columns: From, To, CC, Subject, and the body of the text.
我需要清理数据以提取以下列:From、To、CC、Subject 和正文。 The e-mails typically look like this:
电子邮件通常如下所示:
From : Vincent Adultman
To : Business Person,
Cc :
Subject: On the subject of business Transactions
Dear blabla,
We would like to bla bla to improve our bla bla by X%.
Thanks in advance
I was able to extract the first four columns using the following regex expression:我能够使用以下正则表达式提取前四列:
import pandas as pd
df = pd.DataFrame(data=data,columns=['text'],dtype='string')
df['from'] = df.loc[:,'text'].str.extract(pat=r'(\bFrom .+)')
df['to'] = df.loc[:,'text'].str.extract(pat=r'(\bTo .+)')
df['cc'] = df.loc[:,'text'].str.extract(pat=r'(\bCc .+)')
df['bcc'] = df.loc[:,'text'].str.extract(pat=r'(\bBcc .+)')
df['subject'] = df.loc[:,'text'].str.extract(pat=r'(\bSubject: .+)')
Now I am trying to extract the rest of the body that starts at Dear blabla
.现在我试图提取从
Dear blabla
开始的身体的其余部分。 However since every e-mail is different, I can't go matching on Dear blabla
.但是,由于每封电子邮件都不同,我无法在
Dear blabla
上进行匹配。
How can I match all the text except the first four matches I have already done?除了我已经完成的前四个匹配项之外,我如何匹配所有文本?
Here is what I have tried:这是我尝试过的:
df.loc[:,'text'].str.extract(pat=r'^(\bFrom .+|\bTo .+|\bCc .+|Bcc .+|\bSubject .+)')
df.loc[:,'text'].str.extract(pat=r'^[(\bFrom .+|\bTo .+|\bCc .+|Bcc .+|\bSubject .+)]')
What am I doing wrong?我究竟做错了什么?
You can use您可以使用
df['body'] = df['text'].str.replace(r'^(?:\n?(?:From|To|Cc|Subject)\s*:.*)+\s*', '')
See the regex demo .请参阅正则表达式演示。
Details细节
^
- start of string ^
- 字符串的开始(?:\\n?(?:From|To|Cc|Subject)\\s*:.*)+
- one or more repetitions of (?:\\n?(?:From|To|Cc|Subject)\\s*:.*)+
- 一次或多次重复
\\n?
- an optional newline, line feed char (?:From|To|Cc|Subject)
- either From
, or To
, Cc
, Subject
(?:From|To|Cc|Subject)
- From
, 或To
, Cc
, Subject
\\s*:
- 0 or more whitespace chars and a :
char \\s*:
- 0 个或多个空格字符和一个:
char.*
- any 0 or more chars other than line break chars, as many as possible .*
- 除换行符以外的任何 0 个或更多字符,尽可能多\\s*
- 0 or more whitespace chars. \\s*
- 0 个或多个空白字符。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.