简体   繁体   English

多行正则表达式:如何在熊猫数据框中的日期之间提取文本?

[英]Multiline regex: How to extract text between dates in pandas dataframe?

I have dataframe with description column, under one row of description there are multiple lines of texts, basically those are set of information for each record.我有带描述列的数据框,在一行描述下有多行文本,基本上这些是每条记录的信息集。

Example: Regarding information no 1 at 07-01-2019 we got update as the sky is blue and at 05-22-2019 we again got update as Apples are red, that are arranged between two dates.示例:关于信息 1 在 07-01-2019 我们得到更新,因为天空是蓝色的,在 05-22-2019 我们再次得到更新,因为苹果是红色的,排列在两个日期之间。 Firstly, I would like to extract text between the date and split the respective details in new columns as date, name, description.首先,我想提取日期之间的文本,并将新列中的各个详细信息拆分为日期、名称、描述。

The raw description looks like原始描述看起来像

info no|           Description
--------------------------------------------------------------------------
1      |07-01-2019 12:59:41 - XYZ (Work notes) The sky is blue in color.
       |                                        Clouds are looking lovely.
       | 05-22-2019 12:00:49 - MNX  (Work notes) Apples are red in color.
--------------------------------------------------------------------------    
       |  02-26-2019 12:53:18 - ABC (Work notes) Task is to separate balls.
2      |  02-25-2019 16:57:57 - lMN (Work notes) He came by train.
       |                                         That train was 15 min late.
       |                                         He missed the concert.
       |  02-25-2019 11:08:01 - sbc (Work notes) She is my grandmother.

Desired output is期望的输出是

info No |DATE                   |  NAME |   DESCRIPTION
--------|------------------------------------------------------
   1    |07-01-2019 12:59:41    |   xyz  |  The sky is blue in color.
        |                       |        |  Clouds are looking lovely.
--------|---------------------------------------------------------
   1    |05-22-2019 12:00:49    |   MNX  |  Apples are red in color                     
--------|---------------------------------------------------------
   2    | 02-26-2019 12:53:18   |   ABC  |  Task is to separate blue balls.
--------|---------------------------------------------------------
   2    |  02-25-2019 16:57:57  |   IMN   |  He came by train
        |                       |         |  That train was 15 min late.
        |                       |         |  He missed the concert.
--------|---------------------------------------------------------
        |  02-25-2019 11:08:01  |   sbc   | She is my grandmother.

I tried:我试过:

 myDf = pd.DataFrame(re.split('(\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2} -.*)',Description),columns = ['date'])
 myDf['date'] = myDf['date'].replace('(Work notes)','-', regex=True)
 newQueue = newQueue.date.str.split(-,n=3)

Having this dataframe有这个数据框

df
                                             Description
Sl No                                                   
1      07-01-2019 12:59:41 - XYZ (Work notes) The sky...
2      05-22-2019 12:00:49 - MNX  (Work notes) Apples...
3      02-26-2019 12:53:18 - ABC (Work notes) Task is...
4      02-25-2019 16:57:57 - lMN (Work notes) He came...
5      02-25-2019 11:08:01 - sbc (Work notes) She is ...

you can split the strings at the description column by "(Work notes)" and then you can use values.tolist to split it into 2 columns as follows:您可以通过“(工作笔记)”拆分描述列中的字符串,然后可以使用 values.tolist 将其拆分为 2 列,如下所示:

x['Description']=x['Description'].apply(lambda x: x.split('(Work notes)'))

x=pd.DataFrame(x['Description'].values.tolist(), index= x.index)

print(x)

                                 0                            1
Sl No                                                          
1       07-01-2019 12:59:41 - XYZ     The sky is blue in color.
2      05-22-2019 12:00:49 - MNX       Apples are red in color.
3       02-26-2019 12:53:18 - ABC    Task is to separate balls.
4       02-25-2019 16:57:57 - lMN             He came by train.
5       02-25-2019 11:08:01 - sbc        She is my grandmother.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM