[英]Extract elements from data column (String) before and after character
I want to extract a character before and after certain characters in a string, most of these are in a pandas dataframe column.我想在字符串中的某些字符之前和之后提取一个字符,其中大部分都在熊猫数据框列中。
Basically I want to take from my principal dataframe and merge together is from my columns 'Strain' and 'Region' taking the following items:基本上,我想从我的主要数据框中获取并合并来自我的列“应变”和“区域”,其中包含以下项目:
i) Original Strain: Streptomyces_sp_QL40_O i) 原始菌株: Streptomyces_sp_QL40_O
ii) Original Region: Region 1.1 ii) 原始区域: Region 1.1
Extract:提炼:
Desired Output : QL40_1.region001所需输出: QL40_1.region001
Example below下面的例子
import pandas as pd
data = [['Streptomyces_sp_QL40_O', 'Region 1.1'], ['Streptomyces_sp_QL40_O', 'Region 2.2'], ['Streptomyces_sp_QL40_O', 'Region 2.1']]
df = pd.DataFrame(data, columns = ['Strain', 'Region'])
print(df)
region_list = ['QL40_1.region001', 'QL40_2.region002', 'QL40_3.region001']
I started with something like this:我从这样的事情开始:
df['BGC Region'] = df['Strain'].str.split('_').str[2]
print('DataFrame Modified')
df['BGC Region'] = df['BGC Region'].astype(str) + '_'
df['Region No'] = df['Region'].str.split('.').str[1]
I am not really sure if this is what you want, but it does the work:我不确定这是否是您想要的,但它确实有效:
regions = []
for i in df['Region'].str.split('.').str[0]:
regions.append(''.join([d for d in i if d.isdigit()]))
df['BGC Region'] = df['Strain'].str.split('_').str[2] + '_' + regions + '.region'
region_number = df['Region'].str.split('.').str[1]
for i, rn in enumerate(region_number):
if int(rn) < 10:
df['BGC Region'][i] += '00' + rn
elif int(rn) < 100:
df['BGC Region'][i] += '0' + rn
The idea is to:这个想法是:
str.extract
to extract the parts of interest, specified with a regex pattern with proper named capturing groups,str.extract
来提取感兴趣的部分,用带有适当命名捕获组的正则表达式模式指定, To implement it, start with creating of an intermediate DataFrame:要实现它,请从创建一个中间 DataFrame 开始:
df2 = (df.Strain + '_' + df.Region).str.extract(
r'(?:[^_]+_){2}(?P<QL>[^_]+)_[^_]+_(?P<Rg>[^&]+)\D+(?P<D1>\d)\.(?P<D2>\d)')
The result, for your data, is:对于您的数据,结果是:
QL Rg D1 D2
0 QL40 Region 1 1
1 QL40 Region 2 2
2 QL40 Region 2 1
Then define a merging function, to be applied for each row from df2 :然后定义一个合并函数,应用于来自df2 的每一行:
def mrg(row):
rg = row.Rg + '0'
if len(rg) < 11:
rg += '0'
return row.QL + '_' + row.D1 + '.' + rg + row.D2
And to get the final result, run:要获得最终结果,请运行:
region_list = df2.apply(mrg, axis=1).tolist()
The result is:结果是:
['QL40_1.Region001', 'QL40_2.Region002', 'QL40_2.Region001']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.