Pandas: selecting multiple columns from one row

Question

I have a script that does things for me, but very inefficiently. I asked for some help on code reviewers, and was told to try Pandas instead. This is what I've done, but I'm having some difficulty understand how it works. I've tried to read the documentation and other questions here, but I can't find any answer.

So, I've got a dataframe with a small amount of rows (20 to couple of hundred) and a smaller number of columns. I've used the read_table pandas function to get at the original data in .txt form, which looks like this:

[ID1, Gene1, Sequence1, Ratio1, Ratio2, Ratio3]
[ID1, Gene1, Sequence2, Ratio1, Ratio2, Ratio3]
[ID2, Gene2, Sequence3, Ratio1, Ratio2, Ratio3]
[ID2, Gene3, Sequence4, Ratio1, Ratio2, Ratio3]
[ID3, Gene3, Sequence5, Ratio1, Ratio2, Ratio3]

... along with a whole bunch of unimportant columns.

What I want to be able to do is to select all the ratios from each Sequence and perform some calculations and statistics on them (all 3 ratios for each sequence, that is). I've tried

df.groupby('Sequence')
for col in df:
    do something / print(col) / print(col[0])

... but that only makes me more confused. If I pass print(col), I get some kind of df construct printed, whereas if I pass print(col[0]), I only get the sequences. As far as I can see in the construct, I should still have all the other columns and their data, since groupby() doesn't remove any data, it just groups it by some input column. What am I doing wrong?

Although I haven't gotten that far yet, due to the problems above, I also want my script to be able to select all the ratios for every ID and perform the same calculations on them, but this time every ratio by itself (ie Ratio1 for all rows of ID1, the same for Ratio2, etc.). And, lastly, do the same thing for every gene.

EDIT:

So, say I want to perform this calculation on every ratio in the row, and then take the median of the three resulting values:

df[Value1] = spike[data['ID']] / float(data['Ratio 1]) * (10**-12) * (6.022*10**23) / (1*10**6)
df[Value2] = spike[data['ID']] / float(data['Ratio 2]) * (10**-12) * (6.022*10**23) / (1*10**6)
df[Value3] = spike[data['ID']] / float(data['Ratio 3]) * (10**-12) * (6.022*10**23) / (1*10**6)

... where spike is a dictionary, and the keys are the IDs. Ignoring the dict part, I can make calculations (thanks!), but how do I access the dictionary using the dataframe IDs? With the above code, I just get a "Unhashable type: Series" error.

Here's some real data:

ID  Gene    Sequence    Ratio1  Ratio2  Ratio3
1   KRAS    SFEDXXYR    15.822  14.119  14.488
2   KRAS    VEDAXXXLVR  9.8455  8.9279  16.911
3   ELK4    IEXXXCESLNK 15.745  7.9122  9.5966
3   ELK4    IEGXXXSLNKR 1.177   NaN     12.073

Answer 1

df.groupby() does not modify/group df in place. So you have to assign the result to a new variable to further use it. Eg :
```
 grouped = df.groupby('Sequence') 
```
BTW, in the example data you give, all data in the Sequence column are unique, so grouping on that column will not do much.
Furthermore, you normally don't need to 'iterate over the df' as you do here. To apply a function to all groups, you can do that directly on the groupby result, eg df.groupby().apply(..) or df.groupby().aggregate(..) .
Can you give a more specific example of what kind of function you want to apply to the ratios?
To calculate the median of the three ratio's for each sequence (each row), you can do:
```
 df[['Ratio1', 'Ratio2', 'Ratio3']].median(axis=1) 
```
The axis=1 means that you do not want to take the median of one column (over the rows), but for each row (over the columns)

Another examle, to calculate the median of all Ratio1's for each ID, you can do:

df.groupby('ID')['Ratio1'].median()

Here you group by ID , select column Ratio1 and calculate the median value for each group.

UPDATE: you should probably split the questions into seperate ones, but as an answer to your new question:

data['ID'] will give you the ID column, so you cannot use it as a key. You want one specific value of that column. To apply a function on each row of a dataframe, you can use apply :

def my_func(row):
    return spike[row['ID']] / float(row['Ratio 1']) * (10**-12) * (6.022*10**23) / (1*10**6)

df['Value1'] = df.apply(my_func, axis=1)

Pandas: selecting multiple columns from one row

Question

1 answers

solution1
1 ACCPTED 2014-01-13 10:44:12

Pandas: selecting multiple columns from one row

Question

1 answers

solution1 1 ACCPTED 2014-01-13 10:44:12

solution1
1 ACCPTED 2014-01-13 10:44:12