从字符前后的数据列（字符串）中提取元素

Question

I want to extract a character before and after certain characters in a string, most of these are in a pandas dataframe column.我想在字符串中的某些字符之前和之后提取一个字符，其中大部分都在熊猫数据框列中。

Basically I want to take from my principal dataframe and merge together is from my columns 'Strain' and 'Region' taking the following items:基本上，我想从我的主要数据框中获取并合并来自我的列“应变”和“区域”，其中包含以下项目：

i) Original Strain: Streptomyces_sp_QL40_O i) 原始菌株： Streptomyces_sp_QL40_O

ii) Original Region: Region&nbsp1.1 ii) 原始区域： Region&nbsp1.1

Extract:提炼：

The string after the second underscore Ex: QL40第二个下划线后的字符串Ex: QL40
The first number before the '.' '.' 之前的第一个数字Ex: nbsp.例如： nbsp。 1 1
The second number after the '.' '.' 后的第二个数字Ex: .例如：。 1 1
The string region before the '&' character '&' 字符之前的字符串区域
Add two 0's after string 'region' if digit is less than 10 and one 0 if digit is more than ten.如果数字小于 10，则在字符串 'region' 后添加两个 0，如果数字大于 10，则添加一个 0。

Desired Output : QL40_1.region001所需输出： QL40_1.region001

Example below下面的例子

    import pandas as pd 

    data = [['Streptomyces_sp_QL40_O', 'Region&nbsp1.1'], ['Streptomyces_sp_QL40_O', 'Region&nbsp2.2'], ['Streptomyces_sp_QL40_O', 'Region&nbsp2.1']]
    df = pd.DataFrame(data, columns = ['Strain', 'Region'])

    print(df)

    region_list = ['QL40_1.region001', 'QL40_2.region002', 'QL40_3.region001']

I started with something like this:我从这样的事情开始：

    df['BGC Region'] = df['Strain'].str.split('_').str[2]
    print('DataFrame Modified')
    df['BGC Region'] = df['BGC Region'].astype(str) + '_' 
    df['Region No'] = df['Region'].str.split('.').str[1]

Answer 1

I am not really sure if this is what you want, but it does the work:我不确定这是否是您想要的，但它确实有效：

regions = []
for i in df['Region'].str.split('.').str[0]:
    regions.append(''.join([d for d in i if d.isdigit()]))

df['BGC Region'] = df['Strain'].str.split('_').str[2] + '_' + regions + '.region'

region_number = df['Region'].str.split('.').str[1]
for i, rn in enumerate(region_number):
    if int(rn) < 10:
        df['BGC Region'][i] += '00' + rn
    elif int(rn) < 100:
        df['BGC Region'][i] += '0' + rn

Answer 2

The idea is to:这个想法是：

concatenate your 2 columns (inserting a '_' between them),连接您的 2 列（在它们之间插入一个“_”），
call str.extract to extract the parts of interest, specified with a regex pattern with proper named capturing groups,调用str.extract来提取感兴趣的部分，用带有适当命名捕获组的正则表达式模式指定，
for each row, merge these parts, adding the required number of zeroes.对于每一行，合并这些部分，添加所需数量的零。

To implement it, start with creating of an intermediate DataFrame:要实现它，请从创建一个中间 DataFrame 开始：

df2 = (df.Strain + '_' + df.Region).str.extract(
    r'(?:[^_]+_){2}(?P<QL>[^_]+)_[^_]+_(?P<Rg>[^&]+)\D+(?P<D1>\d)\.(?P<D2>\d)')

The result, for your data, is:对于您的数据，结果是：

     QL      Rg D1 D2
0  QL40  Region  1  1
1  QL40  Region  2  2
2  QL40  Region  2  1

Then define a merging function, to be applied for each row from df2 :然后定义一个合并函数，应用于来自df2 的每一行：

def mrg(row):
    rg = row.Rg + '0'
    if len(rg) < 11:
        rg += '0'
    return row.QL + '_' + row.D1 + '.' + rg + row.D2

And to get the final result, run:要获得最终结果，请运行：

region_list = df2.apply(mrg, axis=1).tolist()

The result is:结果是：

['QL40_1.Region001', 'QL40_2.Region002', 'QL40_2.Region001']

从字符前后的数据列（字符串）中提取元素

问题描述

2 个解决方案

解决方案1
5 已采纳 2020-03-09 19:17:22

解决方案2
3 2020-03-09 19:24:24

从字符前后的数据列（字符串）中提取元素

问题描述

2 个解决方案

解决方案1 5 已采纳 2020-03-09 19:17:22

解决方案2 3 2020-03-09 19:24:24

解决方案1
5 已采纳 2020-03-09 19:17:22

解决方案2
3 2020-03-09 19:24:24