简体   繁体   中英

Extract specific part in a column using regex in pandas

Hello I have a dataframe such as

COL1 
scaffold_6202_0_5660-8393_+__Apis_cerana
scaffold_27087_2-HSPs_+__Canis_lupus
LBMM01007576.1_2-HSPs_-__Lasius_niger
NW_019416736.1_1_2-HSPs_-__Cattus_felis
KQ415617.1_114142-115354_+__SPO_E
UXGB01011990.1_1481-2897_-__Apis_mellifera
CM010866.1_742312-745306_-__Cuniculus_griseus
scaffold_10628_4264-5914_-__Rattus_rattus 
IDBA_scaffold30_1_30-466_+__SP_A
IDBA_scaffold43_30-466_+__SP_B

and I would like to use a regex expression in order to extract only the part between:

[part to extract]_Number-HSPs_* or if there is not the HSPs pattern extract [part to extract]_Number*-Number_*

and save it into a COL2 Here I should get:

COL1                                          COL2
scaffold_6202_0_5660-8393_+__Apis_cerana      scaffold_6202_0
scaffold_27087_2-HSPs_+__Canis_lupus          scaffold_27087
LBMM01007576.1_2-HSPs_-__Lasius_niger         LBMM01007576.1
NW_019416736.1_1_2-HSPs_-__Cattus_felis       NW_019416736.1_1
KQ415617.1_114142-115354_+__SPO_E             KQ415617.1
UXGB01011990.1_1481-2897_-__Apis_mellifera    UXGB01011990.1
CM010866.1_742312-745306_-__Cuniculus_griseus CM010866.1
scaffold_10628_4264-5914_-__Rattus_rattus     scaffold_10628
IDBA_scaffold30_1_30-466_+__SP_A              IDBA_scaffold30_1
IDBA_scaffold43_30-466_+__SP_B                IDBA_scaffold43

So far I succeded to use

import re 

df['COL2'] = re.sub(r"_[^0-9]*-Number_", "", df['COL1'])

For the example data, you might also match either word characters and dots until the last underscore it can match, as a word character also matches an underscore.

^([\w.]+)_

Regex demo

df['COL2'] = df["COL1"].str.extract(r'^([\w.]+)_')

Using str.extract :

df["COL2"] = df["COL2"].str.extract('(^.*?(?=_[^_-]+-\w+))')

Demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM