Hello I have a dataframe such as
COL1
scaffold_6202_0_5660-8393_+__Apis_cerana
scaffold_27087_2-HSPs_+__Canis_lupus
LBMM01007576.1_2-HSPs_-__Lasius_niger
NW_019416736.1_1_2-HSPs_-__Cattus_felis
KQ415617.1_114142-115354_+__SPO_E
UXGB01011990.1_1481-2897_-__Apis_mellifera
CM010866.1_742312-745306_-__Cuniculus_griseus
scaffold_10628_4264-5914_-__Rattus_rattus
IDBA_scaffold30_1_30-466_+__SP_A
IDBA_scaffold43_30-466_+__SP_B
and I would like to use a regex expression in order to extract only the part between:
[part to extract]_Number-HSPs_*
or if there is not the HSPs pattern extract [part to extract]_Number*-Number_*
and save it into a COL2
Here I should get:
COL1 COL2
scaffold_6202_0_5660-8393_+__Apis_cerana scaffold_6202_0
scaffold_27087_2-HSPs_+__Canis_lupus scaffold_27087
LBMM01007576.1_2-HSPs_-__Lasius_niger LBMM01007576.1
NW_019416736.1_1_2-HSPs_-__Cattus_felis NW_019416736.1_1
KQ415617.1_114142-115354_+__SPO_E KQ415617.1
UXGB01011990.1_1481-2897_-__Apis_mellifera UXGB01011990.1
CM010866.1_742312-745306_-__Cuniculus_griseus CM010866.1
scaffold_10628_4264-5914_-__Rattus_rattus scaffold_10628
IDBA_scaffold30_1_30-466_+__SP_A IDBA_scaffold30_1
IDBA_scaffold43_30-466_+__SP_B IDBA_scaffold43
So far I succeded to use
import re
df['COL2'] = re.sub(r"_[^0-9]*-Number_", "", df['COL1'])
For the example data, you might also match either word characters and dots until the last underscore it can match, as a word character also matches an underscore.
^([\w.]+)_
df['COL2'] = df["COL1"].str.extract(r'^([\w.]+)_')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.