简体   繁体   中英

How to seperate sentences in a dataframe based on last occurence of small letter followed by a capital one

I have a dataframe containing sentences. The first sentence (the title) is followed by the text. They were merged without a space.

I would like to slit the text into two parts (sentence 1 and sentence 2) based on the last occurence of a capital letter following a lowercase letter without a space in between (out of curiosity I would also be interested in a solution based on the first appearance).

The solution is supposed to be stored in the original dataframe.

I tried

re.findall('(?<!\s)[A-ZÄÖÜ](?:[a-zäöüß\s]|(?<=\s)[A-ZÄÖÜ])*')

but could not work it out.

import pandas
from pandas import DataFrame

Sentences = {'Sentence': ['RnB music all nightI love going out','Example sentence with no meaningThe space is missing.','Third exampleAlso numbers 1.23 and signs -. should appear in column 2.', 'BestMusic tonightAt 12:00.']}

df = DataFrame(Sentences,columns= ['Sentence'])

print(df)

As the split is supposed to be carried out at the last occurrence. The words RnB and BestMusic in the example given are not supposed to trigger the split.

df.Sentence1 = ['RnB music all night','Example sentence with no meaning','Third example', 'BestMusic tonight']

df.Sentence2 = ['I love going out','The space is missing.', 'Also numbers 1.23 and signs -. should appear in column 2.' ,'At 12:00.']

Here is one way

Yourdf=df.Sentence.str.split(r'(.*[a-z])(?=[A-Z])',n=-1,expand=True)[[1,2]]
Yourdf
Out[610]: 
                                  1                                                  2
0               RnB music all night                                   I love going out
1  Example sentence with no meaning                              The space is missing.
2                     Third example  Also numbers 1.23 and signs -. should appear i...
3                 BestMusic tonight                                          At 12:00.

This only works if AZ is all your capital letters:

pattern = r'(?P<Sentence1>.*)(?P<Sentence2>[A-Z].*)$'
df['Sentence'].str.extract(pattern)

gives:

    Sentence1                           Sentence2
0   RnB music all night                 I love going out
1   Example sentence with no meaning    The space is missing.
2   Third example                       Also numbers 1.23 and signs -. should appear i...
3   BestMusic tonight                   At 12:00.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM