简体   繁体   中英

Parse pandas df column with regex extracting substrings

I have a pandas df containing a column composed of text like:

String1::some_text::some_text;String2::some_text::;String3::some_text::some_text;String4::some_text::some_text

I can see that:

  1. The start of the text always contains the first string I want to extract
  2. The rest of the strings are in between "::" and ";"

I want to create a new column containing:

String1, String2, String3, String4

All separed by a comma but still in the same column.

How to approach the problem?

Thanks for your help

try this:

In [136]: df.txt.str.findall(r'String\d+').str.join(', ')
Out[136]:
0    String1, String2, String3, String4
Name: txt, dtype: object

Data:

In [137]: df
Out[137]:
                                                                                                   txt
0  String1::some_text::some_text;String2::some_text::;String3::some_text::some_text;String4::some_t...

Setup:

df = pd.DataFrame({'txt': ['String1::some_text::some_text;String2::some_text::;String3::some_text::some_text;String4::some_text::some_text']})

consider the dataframe df with column txt

df = pd.DataFrame(['String1::some_text::some_text;String2::some_text::;String3::some_text::some_text;String4::some_text::some_text'] * 10,
                  columns=['txt'])
df

在此处输入图片说明


use a combination of str.split and groupby

df.txt.str.split(';', expand=True).stack() \
      .str.split('::').str[0].groupby(level=0).apply(list)

0    [String1, String2, String3, String4]
1    [String1, String2, String3, String4]
2    [String1, String2, String3, String4]
3    [String1, String2, String3, String4]
4    [String1, String2, String3, String4]
5    [String1, String2, String3, String4]
6    [String1, String2, String3, String4]
7    [String1, String2, String3, String4]
8    [String1, String2, String3, String4]
9    [String1, String2, String3, String4]
dtype: object

I would just apply a lambda function to do the operation you want to do (split first on ";", then split on "::" and keep the first element, and join them back):

df['new_col'] = df['old_col'].apply(lambda s: ", ".join(t.split("::")[0] for t in s.split(";")))

You could also avoid splitting on :: since simply stopping before the first : is enough:

df['new_col'] = df['old_col'].apply(lambda s: ", ".join(t[:t.index(":")] for t in s.split(";")))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM