str.split by regex (complex pattern)

Question

How do I split the ID from annotation by using regex in the data frame below?

df=pd.DataFrame({"header":["SS50377_28860 All-trans-retinol 13,14-reductase"]})

So the columns supposed to be like this:

df_new=pd.DataFrame({"id":"SS50377_28860","header":["All-trans-retinol 13,14-reductase"]})

The following code doesn't work properly.

df.join(df["header"].str.split(r'\d+', 0, expand=True))

Thanks in advance!!

Answer 1

You can split with one or more whitespaces between a digit and a letter:

df[['id','header']] = df['header'].str.split(r'(?<=\d)\s+(?=[A-Z])', n=1, expand=True)

Or, you may capture the ID pattern into one group and the rest into another:

df[['id', 'header']] = df['header'].str.extract(r'^([A-Z0-9]+_[A-Z0-9]+)\s+(.*)', expand=True)

Or, you may simply Series.str.split with the first whitespace chunk:

df[['id', 'header']] = df['header'].str.split("\s+", n=1, expand=True)

Output:

>>> df
                              header             id
0  All-trans-retinol 13,14-reductase  SS50377_28860

Details :

(?<=\\d)\\s+(?=[AZ]) - matches one or more whitespaces ( \\s+ ) that are immediately preceded with a digit ( (?<=\\d) ) and immediately followed with an uppercase ASCII letter ( [AZ] )
^([A-Z0-9]+_[A-Z0-9]+)\\s+(.*) - matches start of string ( ^ ), then captures one or more uppercase ASCII letters or digits, _ and again one or more uppercase ASCII letters or digits into Group 1 (Column "id") and then matches one or more whitespaces ( \\s+ ) and then captures the rest of the line into Group 2 (with (.*) ).

Whichever solution you choose depends on how varied your input is and how much validation you want to apply here.