[英]Multiline regex: How to extract text between dates in pandas dataframe?
I have dataframe with description column, under one row of description there are multiple lines of texts, basically those are set of information for each record.我有带描述列的数据框,在一行描述下有多行文本,基本上这些是每条记录的信息集。
Example: Regarding information no 1 at 07-01-2019 we got update as the sky is blue and at 05-22-2019 we again got update as Apples are red, that are arranged between two dates.示例:关于信息 1 在 07-01-2019 我们得到更新,因为天空是蓝色的,在 05-22-2019 我们再次得到更新,因为苹果是红色的,排列在两个日期之间。 Firstly, I would like to extract text between the date and split the respective details in new columns as date, name, description.
首先,我想提取日期之间的文本,并将新列中的各个详细信息拆分为日期、名称、描述。
The raw description looks like原始描述看起来像
info no| Description
--------------------------------------------------------------------------
1 |07-01-2019 12:59:41 - XYZ (Work notes) The sky is blue in color.
| Clouds are looking lovely.
| 05-22-2019 12:00:49 - MNX (Work notes) Apples are red in color.
--------------------------------------------------------------------------
| 02-26-2019 12:53:18 - ABC (Work notes) Task is to separate balls.
2 | 02-25-2019 16:57:57 - lMN (Work notes) He came by train.
| That train was 15 min late.
| He missed the concert.
| 02-25-2019 11:08:01 - sbc (Work notes) She is my grandmother.
Desired output is期望的输出是
info No |DATE | NAME | DESCRIPTION
--------|------------------------------------------------------
1 |07-01-2019 12:59:41 | xyz | The sky is blue in color.
| | | Clouds are looking lovely.
--------|---------------------------------------------------------
1 |05-22-2019 12:00:49 | MNX | Apples are red in color
--------|---------------------------------------------------------
2 | 02-26-2019 12:53:18 | ABC | Task is to separate blue balls.
--------|---------------------------------------------------------
2 | 02-25-2019 16:57:57 | IMN | He came by train
| | | That train was 15 min late.
| | | He missed the concert.
--------|---------------------------------------------------------
| 02-25-2019 11:08:01 | sbc | She is my grandmother.
I tried:我试过:
myDf = pd.DataFrame(re.split('(\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2} -.*)',Description),columns = ['date'])
myDf['date'] = myDf['date'].replace('(Work notes)','-', regex=True)
newQueue = newQueue.date.str.split(-,n=3)
Having this dataframe有这个数据框
df
Description
Sl No
1 07-01-2019 12:59:41 - XYZ (Work notes) The sky...
2 05-22-2019 12:00:49 - MNX (Work notes) Apples...
3 02-26-2019 12:53:18 - ABC (Work notes) Task is...
4 02-25-2019 16:57:57 - lMN (Work notes) He came...
5 02-25-2019 11:08:01 - sbc (Work notes) She is ...
you can split the strings at the description column by "(Work notes)" and then you can use values.tolist to split it into 2 columns as follows:您可以通过“(工作笔记)”拆分描述列中的字符串,然后可以使用 values.tolist 将其拆分为 2 列,如下所示:
x['Description']=x['Description'].apply(lambda x: x.split('(Work notes)'))
x=pd.DataFrame(x['Description'].values.tolist(), index= x.index)
print(x)
0 1
Sl No
1 07-01-2019 12:59:41 - XYZ The sky is blue in color.
2 05-22-2019 12:00:49 - MNX Apples are red in color.
3 02-26-2019 12:53:18 - ABC Task is to separate balls.
4 02-25-2019 16:57:57 - lMN He came by train.
5 02-25-2019 11:08:01 - sbc She is my grandmother.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.