简体   繁体   中英

Multiline regex: How to extract text between dates in pandas dataframe?

I have dataframe with description column, under one row of description there are multiple lines of texts, basically those are set of information for each record.

Example: Regarding information no 1 at 07-01-2019 we got update as the sky is blue and at 05-22-2019 we again got update as Apples are red, that are arranged between two dates. Firstly, I would like to extract text between the date and split the respective details in new columns as date, name, description.

The raw description looks like

info no|           Description
--------------------------------------------------------------------------
1      |07-01-2019 12:59:41 - XYZ (Work notes) The sky is blue in color.
       |                                        Clouds are looking lovely.
       | 05-22-2019 12:00:49 - MNX  (Work notes) Apples are red in color.
--------------------------------------------------------------------------    
       |  02-26-2019 12:53:18 - ABC (Work notes) Task is to separate balls.
2      |  02-25-2019 16:57:57 - lMN (Work notes) He came by train.
       |                                         That train was 15 min late.
       |                                         He missed the concert.
       |  02-25-2019 11:08:01 - sbc (Work notes) She is my grandmother.

Desired output is

info No |DATE                   |  NAME |   DESCRIPTION
--------|------------------------------------------------------
   1    |07-01-2019 12:59:41    |   xyz  |  The sky is blue in color.
        |                       |        |  Clouds are looking lovely.
--------|---------------------------------------------------------
   1    |05-22-2019 12:00:49    |   MNX  |  Apples are red in color                     
--------|---------------------------------------------------------
   2    | 02-26-2019 12:53:18   |   ABC  |  Task is to separate blue balls.
--------|---------------------------------------------------------
   2    |  02-25-2019 16:57:57  |   IMN   |  He came by train
        |                       |         |  That train was 15 min late.
        |                       |         |  He missed the concert.
--------|---------------------------------------------------------
        |  02-25-2019 11:08:01  |   sbc   | She is my grandmother.

I tried:

 myDf = pd.DataFrame(re.split('(\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2} -.*)',Description),columns = ['date'])
 myDf['date'] = myDf['date'].replace('(Work notes)','-', regex=True)
 newQueue = newQueue.date.str.split(-,n=3)

Having this dataframe

df
                                             Description
Sl No                                                   
1      07-01-2019 12:59:41 - XYZ (Work notes) The sky...
2      05-22-2019 12:00:49 - MNX  (Work notes) Apples...
3      02-26-2019 12:53:18 - ABC (Work notes) Task is...
4      02-25-2019 16:57:57 - lMN (Work notes) He came...
5      02-25-2019 11:08:01 - sbc (Work notes) She is ...

you can split the strings at the description column by "(Work notes)" and then you can use values.tolist to split it into 2 columns as follows:

x['Description']=x['Description'].apply(lambda x: x.split('(Work notes)'))

x=pd.DataFrame(x['Description'].values.tolist(), index= x.index)

print(x)

                                 0                            1
Sl No                                                          
1       07-01-2019 12:59:41 - XYZ     The sky is blue in color.
2      05-22-2019 12:00:49 - MNX       Apples are red in color.
3       02-26-2019 12:53:18 - ABC    Task is to separate balls.
4       02-25-2019 16:57:57 - lMN             He came by train.
5       02-25-2019 11:08:01 - sbc        She is my grandmother.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM