简体   繁体   中英

Splitting a string ignoring case in Pandas

What I need to do would be something like:

df[col].str.split(my_regexp, re.IGNORECASE, expand=True)

However, the pandas DataFrame.str.split method doesn't have the possibility to add regexp flags.

Since I need to expand the results, I cannot do something like

df.apply(lambda x: re.split(my_regexp, x[col], flags=re.IGNORECASE), axis=1, result='expand')

because the lists don't have the same length.

What I would need would be a way to either make re.split returns all lists of the same length, either pass the re.IGNORECASE in the Series.str.split method. Or maybe an even better way?

Thank you everyone!

Edit: Here is some data for a better explanation

series = pd.Series([
    'First paRt foo second part FOO third part',
    'test1 FoO test2', 
    'hi1 bar HI2',
    'This is a Test',
    'first baR second BAr third',
    'final'
])

Should return with the regexp r'foo|bar'


    0               1               2
0   First paRt      second part     third part
1   test1           test2           None
2   hi1             HI2             None
3   This is a Test  None            None
4   first           second          third
5   final           None            None

Method 1: if lowercase / uppercase needs to be retained:

series.apply(lambda x: ', '.join(re.split(r'foo|bar', x, flags=re.IGNORECASE)))\
      .str.split(', ', expand=True)

Output

                0              1            2
0     First paRt    second part    third part
1          test1           test2         None
2            hi1             HI2         None
3  This is a Test           None         None
4          first         second         third
5           final           None         None

Method 2 if lowercase / uppercase is not an issue

As stated in the comments, broadcast your series to lowercase using str.lower() and then use str.split :

series.str.lower().str.split(r'foo|bar', expand=True)

Output

                0              1            2
0     first part    second part    third part
1          test1           test2         None
2            hi1             hi2         None
3  this is a test           None         None
4          first         second         third
5           final           None         None

Method 3 Removing the unnecessary whitespaces:

series.str.lower().str.split(r'foo|bar', expand=True).apply(lambda x: x.str.strip())

Output

                0            1           2
0      first part  second part  third part
1           test1        test2        None
2             hi1          hi2        None
3  this is a test         None        None
4           first       second       third
5           final         None        None

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM