[英]Extract specific part in a column using regex in pandas
您好,我有一個 dataframe 如
COL1
scaffold_6202_0_5660-8393_+__Apis_cerana
scaffold_27087_2-HSPs_+__Canis_lupus
LBMM01007576.1_2-HSPs_-__Lasius_niger
NW_019416736.1_1_2-HSPs_-__Cattus_felis
KQ415617.1_114142-115354_+__SPO_E
UXGB01011990.1_1481-2897_-__Apis_mellifera
CM010866.1_742312-745306_-__Cuniculus_griseus
scaffold_10628_4264-5914_-__Rattus_rattus
IDBA_scaffold30_1_30-466_+__SP_A
IDBA_scaffold43_30-466_+__SP_B
我想使用正則表達式來僅提取以下之間的部分:
[part to extract]_Number-HSPs_*
或者如果沒有HSPs模式提取[part to extract]_Number*-Number_*
並將其保存到COL2
在這里我應該得到:
COL1 COL2
scaffold_6202_0_5660-8393_+__Apis_cerana scaffold_6202_0
scaffold_27087_2-HSPs_+__Canis_lupus scaffold_27087
LBMM01007576.1_2-HSPs_-__Lasius_niger LBMM01007576.1
NW_019416736.1_1_2-HSPs_-__Cattus_felis NW_019416736.1_1
KQ415617.1_114142-115354_+__SPO_E KQ415617.1
UXGB01011990.1_1481-2897_-__Apis_mellifera UXGB01011990.1
CM010866.1_742312-745306_-__Cuniculus_griseus CM010866.1
scaffold_10628_4264-5914_-__Rattus_rattus scaffold_10628
IDBA_scaffold30_1_30-466_+__SP_A IDBA_scaffold30_1
IDBA_scaffold43_30-466_+__SP_B IDBA_scaffold43
到目前為止,我成功使用
import re
df['COL2'] = re.sub(r"_[^0-9]*-Number_", "", df['COL1'])
對於示例數據,您還可以匹配單詞字符和點,直到它可以匹配的最后一個下划線,因為單詞字符也匹配下划線。
^([\w.]+)_
df['COL2'] = df["COL1"].str.extract(r'^([\w.]+)_')
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.