How do I split the ID from annotation by using regex in the data frame below?
df=pd.DataFrame({"header":["SS50377_28860 All-trans-retinol 13,14-reductase"]})
So the columns supposed to be like this:
df_new=pd.DataFrame({"id":"SS50377_28860","header":["All-trans-retinol 13,14-reductase"]})
The following code doesn't work properly.
df.join(df["header"].str.split(r'\d+', 0, expand=True))
Thanks in advance!!
You can split with one or more whitespaces between a digit and a letter:
df[['id','header']] = df['header'].str.split(r'(?<=\d)\s+(?=[A-Z])', n=1, expand=True)
Or, you may capture the ID pattern into one group and the rest into another:
df[['id', 'header']] = df['header'].str.extract(r'^([A-Z0-9]+_[A-Z0-9]+)\s+(.*)', expand=True)
Or, you may simply Series.str.split
with the first whitespace chunk:
df[['id', 'header']] = df['header'].str.split("\s+", n=1, expand=True)
Output:
>>> df
header id
0 All-trans-retinol 13,14-reductase SS50377_28860
Details :
(?<=\\d)\\s+(?=[AZ])
- matches one or more whitespaces ( \\s+
) that are immediately preceded with a digit ( (?<=\\d)
) and immediately followed with an uppercase ASCII letter ( [AZ]
) ^([A-Z0-9]+_[A-Z0-9]+)\\s+(.*)
- matches start of string ( ^
), then captures one or more uppercase ASCII letters or digits, _
and again one or more uppercase ASCII letters or digits into Group 1 (Column "id") and then matches one or more whitespaces ( \\s+
) and then captures the rest of the line into Group 2 (with (.*)
). Whichever solution you choose depends on how varied your input is and how much validation you want to apply here.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.