[英]Multiline regex: How to extract text between dates in pandas dataframe?
我有帶描述列的數據框,在一行描述下有多行文本,基本上這些是每條記錄的信息集。
示例:關於信息 1 在 07-01-2019 我們得到更新,因為天空是藍色的,在 05-22-2019 我們再次得到更新,因為蘋果是紅色的,排列在兩個日期之間。 首先,我想提取日期之間的文本,並將新列中的各個詳細信息拆分為日期、名稱、描述。
原始描述看起來像
info no| Description
--------------------------------------------------------------------------
1 |07-01-2019 12:59:41 - XYZ (Work notes) The sky is blue in color.
| Clouds are looking lovely.
| 05-22-2019 12:00:49 - MNX (Work notes) Apples are red in color.
--------------------------------------------------------------------------
| 02-26-2019 12:53:18 - ABC (Work notes) Task is to separate balls.
2 | 02-25-2019 16:57:57 - lMN (Work notes) He came by train.
| That train was 15 min late.
| He missed the concert.
| 02-25-2019 11:08:01 - sbc (Work notes) She is my grandmother.
期望的輸出是
info No |DATE | NAME | DESCRIPTION
--------|------------------------------------------------------
1 |07-01-2019 12:59:41 | xyz | The sky is blue in color.
| | | Clouds are looking lovely.
--------|---------------------------------------------------------
1 |05-22-2019 12:00:49 | MNX | Apples are red in color
--------|---------------------------------------------------------
2 | 02-26-2019 12:53:18 | ABC | Task is to separate blue balls.
--------|---------------------------------------------------------
2 | 02-25-2019 16:57:57 | IMN | He came by train
| | | That train was 15 min late.
| | | He missed the concert.
--------|---------------------------------------------------------
| 02-25-2019 11:08:01 | sbc | She is my grandmother.
我試過:
myDf = pd.DataFrame(re.split('(\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2} -.*)',Description),columns = ['date'])
myDf['date'] = myDf['date'].replace('(Work notes)','-', regex=True)
newQueue = newQueue.date.str.split(-,n=3)
有這個數據框
df
Description
Sl No
1 07-01-2019 12:59:41 - XYZ (Work notes) The sky...
2 05-22-2019 12:00:49 - MNX (Work notes) Apples...
3 02-26-2019 12:53:18 - ABC (Work notes) Task is...
4 02-25-2019 16:57:57 - lMN (Work notes) He came...
5 02-25-2019 11:08:01 - sbc (Work notes) She is ...
您可以通過“(工作筆記)”拆分描述列中的字符串,然后可以使用 values.tolist 將其拆分為 2 列,如下所示:
x['Description']=x['Description'].apply(lambda x: x.split('(Work notes)'))
x=pd.DataFrame(x['Description'].values.tolist(), index= x.index)
print(x)
0 1
Sl No
1 07-01-2019 12:59:41 - XYZ The sky is blue in color.
2 05-22-2019 12:00:49 - MNX Apples are red in color.
3 02-26-2019 12:53:18 - ABC Task is to separate balls.
4 02-25-2019 16:57:57 - lMN He came by train.
5 02-25-2019 11:08:01 - sbc She is my grandmother.
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.