簡體   English   中英

多行正則表達式:如何在熊貓數據框中的日期之間提取文本?

[英]Multiline regex: How to extract text between dates in pandas dataframe?

我有帶描述列的數據框,在一行描述下有多行文本,基本上這些是每條記錄的信息集。

示例:關於信息 1 在 07-01-2019 我們得到更新,因為天空是藍色的,在 05-22-2019 我們再次得到更新,因為蘋果是紅色的,排列在兩個日期之間。 首先,我想提取日期之間的文本,並將新列中的各個詳細信息拆分為日期、名稱、描述。

原始描述看起來像

info no|           Description
--------------------------------------------------------------------------
1      |07-01-2019 12:59:41 - XYZ (Work notes) The sky is blue in color.
       |                                        Clouds are looking lovely.
       | 05-22-2019 12:00:49 - MNX  (Work notes) Apples are red in color.
--------------------------------------------------------------------------    
       |  02-26-2019 12:53:18 - ABC (Work notes) Task is to separate balls.
2      |  02-25-2019 16:57:57 - lMN (Work notes) He came by train.
       |                                         That train was 15 min late.
       |                                         He missed the concert.
       |  02-25-2019 11:08:01 - sbc (Work notes) She is my grandmother.

期望的輸出是

info No |DATE                   |  NAME |   DESCRIPTION
--------|------------------------------------------------------
   1    |07-01-2019 12:59:41    |   xyz  |  The sky is blue in color.
        |                       |        |  Clouds are looking lovely.
--------|---------------------------------------------------------
   1    |05-22-2019 12:00:49    |   MNX  |  Apples are red in color                     
--------|---------------------------------------------------------
   2    | 02-26-2019 12:53:18   |   ABC  |  Task is to separate blue balls.
--------|---------------------------------------------------------
   2    |  02-25-2019 16:57:57  |   IMN   |  He came by train
        |                       |         |  That train was 15 min late.
        |                       |         |  He missed the concert.
--------|---------------------------------------------------------
        |  02-25-2019 11:08:01  |   sbc   | She is my grandmother.

我試過:

 myDf = pd.DataFrame(re.split('(\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2} -.*)',Description),columns = ['date'])
 myDf['date'] = myDf['date'].replace('(Work notes)','-', regex=True)
 newQueue = newQueue.date.str.split(-,n=3)

有這個數據框

df
                                             Description
Sl No                                                   
1      07-01-2019 12:59:41 - XYZ (Work notes) The sky...
2      05-22-2019 12:00:49 - MNX  (Work notes) Apples...
3      02-26-2019 12:53:18 - ABC (Work notes) Task is...
4      02-25-2019 16:57:57 - lMN (Work notes) He came...
5      02-25-2019 11:08:01 - sbc (Work notes) She is ...

您可以通過“(工作筆記)”拆分描述列中的字符串,然后可以使用 values.tolist 將其拆分為 2 列,如下所示:

x['Description']=x['Description'].apply(lambda x: x.split('(Work notes)'))

x=pd.DataFrame(x['Description'].values.tolist(), index= x.index)

print(x)

                                 0                            1
Sl No                                                          
1       07-01-2019 12:59:41 - XYZ     The sky is blue in color.
2      05-22-2019 12:00:49 - MNX       Apples are red in color.
3       02-26-2019 12:53:18 - ABC    Task is to separate balls.
4       02-25-2019 16:57:57 - lMN             He came by train.
5       02-25-2019 11:08:01 - sbc        She is my grandmother.

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM