简体   繁体   中英

Slice a Pandas object column using other columns

Given this table:

╔═══╦══════════╦═══════════╦═════════════╗
║   ║ position ║ amino_var ║ sequence    ║
╠═══╬══════════╬═══════════╬═════════════╣
║ 0 ║ 3        ║ A         ║ MWSWKCLLFWA ║
║ 1 ║ 4        ║ G         ║ MWSWKCLLFWH ║
║ 2 ║ 6        ║ I         ║ MWSWKCLFLVH ║
║ 3 ║ 3        ║ C         ║ MWSWVESFLVH ║
║ 4 ║ 2        ║ V         ║ MWEQAQPWGAH ║
╚═══╩══════════╩═══════════╩═════════════╝

Or you can construct this dataframe with:

uniprots = pd.DataFrame({'position': [3,4,6,3,2], 'amino_var': ['A', 'G', 'I', 'C', 'V'], 'sequence': ['MWSWKCLLFWA', 'MWSWKCLLFWH', 'MWSWKCLFLVH', 'MWSWVESFLVH', 'MWEQAQPWGAH']})

I would like to slice the sequence part between position + 1 and position - 1 for example, and then replace the letter in position for the letter in amino_var .

I tried this:

uniprots.sequence.str[uniprots.position - 1 : uniprots.position + 1]

But I get a Series full of NaNs. My expected output would be:

╔═══╦════════╗
║   ║ output ║
╠═══╬════════╣
║ 0 ║ WAW    ║
║ 1 ║ SGK    ║
║ 2 ║ KIL    ║
║ 3 ║ WCW    ║
║ 4 ║ MVE    ║
╚═══╩════════╝

I believe you need first extract values before position of range, then by range and replace and last all values after range:

print (uniprots)
  uniprot  position amino amino_var     sequence
0  P11362         3     W         A  WWWWWWWWWWW
1  P11362         4     E         G  MEEEEEELFWH
2  P11362         6     N         I  MWSWKCNNLVH
3  P11362         3     S         C  MWSWVESFLVH
4  P11362         3     W         V  MWEQAQPWGAH

N = 2
def repl(x):
    s = x['sequence']
    p = x['position']
    a1 = x['amino']
    a2 = x['amino_var']
    return s[:p-N-1] + s[p-N-1:p+N].replace(a1,a2) +s[p+N:] 

uniprots['sequence'] = uniprots.apply(repl, axis=1)
print (uniprots)
  uniprot  position amino amino_var     sequence
0  P11362         3     W         A  AAAAAWWWWWW
1  P11362         4     E         G  MGGGGGELFWH
2  P11362         6     N         I  MWSWKCIILVH
3  P11362         3     S         C  MWCWVESFLVH
4  P11362         3     W         V  MVEQAQPWGAH

EDIT by edited answer:

Extract values and join with column amino_var :

N = 1
a = uniprots.apply(lambda x:  x['sequence'][x['position']-N-1 : x['position']-1] , axis=1)
b = uniprots.apply(lambda x:  x['sequence'][x['position'] : x['position']+N] , axis=1)

uniprots['sequence'] = a + uniprots['amino_var'] + b                               
print (uniprots)
   position amino_var sequence
0         3         A      WAW
1         4         G      SGK
2         6         I      KIL
3         3         C      WCW
4         2         V      MVE

You can use DataFrame.apply for this:

def get_subsequence(row, width=1):
    seq = row['sequence']
    pos = row['position']-1
    return seq[pos-width:pos] + row['amino_var'] + seq[pos+1:pos+width+1]

uniprots['sequence'] = uniprots.apply(get_subsequence, axis=1)

We then obtain:

>>> uniprots.apply(get_subsequence, axis=1)
0    WAW
1    SGK
2    KIL
3    WCW
4    MVE
dtype: object

In case we want a larger span, we can set the width parameter, for instance with functools.partial :



uniprots['sequence'] = uniprots.apply(get_subsequence, axis=1)

Which results in:

>>> uniprots.apply(partial(get_subsequence, width=3), axis=1)
0       AWKC
1    MWSGKCL
2    SWKILFL
3       CWVE
4       VEQA

The reason why the strings have no equal length is because we hit the bounds of the string.

Following one-liner also works:

uniprots['output'] = uniprots.apply(lambda x: x['sequence'][x['position']-1-1] +x['amino_var']+x['sequence'][x['position']-1+1], axis=1)

Following format is more readable:

uniprots['output'] = uniprots.apply(lambda x: 
            x['sequence'][x['position']-1-1] +
            x['amino_var'] +
            x['sequence'][x['position']-1+1], axis=1)

Output:

print(uniprots)
  amino_var  position     sequence output
0         A         3  MWSWKCLLFWA    WAW
1         G         4  MWSWKCLLFWH    SGK
2         I         6  MWSWKCLFLVH    KIL
3         C         3  MWSWVESFLVH    WCW
4         V         2  MWEQAQPWGAH    MVE

'position' values start from 1 in this table but from 0 in python, hence -1 has to be done.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM