简体   繁体   中英

str.split by regex (complex pattern)

How do I split the ID from annotation by using regex in the data frame below?

df=pd.DataFrame({"header":["SS50377_28860 All-trans-retinol 13,14-reductase"]})

So the columns supposed to be like this:

df_new=pd.DataFrame({"id":"SS50377_28860","header":["All-trans-retinol 13,14-reductase"]})

The following code doesn't work properly.

df.join(df["header"].str.split(r'\d+', 0, expand=True))

Thanks in advance!!

You can split with one or more whitespaces between a digit and a letter:

df[['id','header']] = df['header'].str.split(r'(?<=\d)\s+(?=[A-Z])', n=1, expand=True)

Or, you may capture the ID pattern into one group and the rest into another:

df[['id', 'header']] = df['header'].str.extract(r'^([A-Z0-9]+_[A-Z0-9]+)\s+(.*)', expand=True)

Or, you may simply Series.str.split with the first whitespace chunk:

df[['id', 'header']] = df['header'].str.split("\s+", n=1, expand=True)

Output:

>>> df
                              header             id
0  All-trans-retinol 13,14-reductase  SS50377_28860

Details :

  • (?<=\\d)\\s+(?=[AZ]) - matches one or more whitespaces ( \\s+ ) that are immediately preceded with a digit ( (?<=\\d) ) and immediately followed with an uppercase ASCII letter ( [AZ] )
  • ^([A-Z0-9]+_[A-Z0-9]+)\\s+(.*) - matches start of string ( ^ ), then captures one or more uppercase ASCII letters or digits, _ and again one or more uppercase ASCII letters or digits into Group 1 (Column "id") and then matches one or more whitespaces ( \\s+ ) and then captures the rest of the line into Group 2 (with (.*) ).

Whichever solution you choose depends on how varied your input is and how much validation you want to apply here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM