Parse pandas df column with regex extracting substrings

Question

I have a pandas df containing a column composed of text like:

String1::some_text::some_text;String2::some_text::;String3::some_text::some_text;String4::some_text::some_text

I can see that:

The start of the text always contains the first string I want to extract
The rest of the strings are in between "::" and ";"

I want to create a new column containing:

String1, String2, String3, String4

All separed by a comma but still in the same column.

How to approach the problem?

Thanks for your help

Answer 1

try this:

In [136]: df.txt.str.findall(r'String\d+').str.join(', ')
Out[136]:
0    String1, String2, String3, String4
Name: txt, dtype: object

Data:

In [137]: df
Out[137]:
                                                                                                   txt
0  String1::some_text::some_text;String2::some_text::;String3::some_text::some_text;String4::some_t...

Setup:

df = pd.DataFrame({'txt': ['String1::some_text::some_text;String2::some_text::;String3::some_text::some_text;String4::some_text::some_text']})

Answer 2

consider the dataframe df with column txt

df = pd.DataFrame(['String1::some_text::some_text;String2::some_text::;String3::some_text::some_text;String4::some_text::some_text'] * 10,
                  columns=['txt'])
df

use a combination of str.split and groupby

df.txt.str.split(';', expand=True).stack() \
      .str.split('::').str[0].groupby(level=0).apply(list)

0    [String1, String2, String3, String4]
1    [String1, String2, String3, String4]
2    [String1, String2, String3, String4]
3    [String1, String2, String3, String4]
4    [String1, String2, String3, String4]
5    [String1, String2, String3, String4]
6    [String1, String2, String3, String4]
7    [String1, String2, String3, String4]
8    [String1, String2, String3, String4]
9    [String1, String2, String3, String4]
dtype: object

Answer 3

I would just apply a lambda function to do the operation you want to do (split first on ";", then split on "::" and keep the first element, and join them back):

df['new_col'] = df['old_col'].apply(lambda s: ", ".join(t.split("::")[0] for t in s.split(";")))

You could also avoid splitting on :: since simply stopping before the first : is enough:

df['new_col'] = df['old_col'].apply(lambda s: ", ".join(t[:t.index(":")] for t in s.split(";")))

Parse pandas df column with regex extracting substrings

Question

3 answers

solution1
1 2016-09-29 11:45:23

solution2
0 2016-09-29 14:18:45

solution3
0 ACCPTED 2016-09-29 14:24:40

Parse pandas df column with regex extracting substrings

Question

3 answers

solution1 1 2016-09-29 11:45:23

solution2 0 2016-09-29 14:18:45

solution3 0 ACCPTED 2016-09-29 14:24:40

solution1
1 2016-09-29 11:45:23

solution2
0 2016-09-29 14:18:45

solution3
0 ACCPTED 2016-09-29 14:24:40