简体   繁体   English

熊猫:从一行中选择多列

[英]Pandas: selecting multiple columns from one row

I have a script that does things for me, but very inefficiently. 我有一个脚本可以为我做事,但是效率很低。 I asked for some help on code reviewers, and was told to try Pandas instead. 我向代码审查员寻求帮助,并被告知尝试使用Pandas。 This is what I've done, but I'm having some difficulty understand how it works. 这是我所做的,但是我在理解其工作方式方面遇到了一些困难。 I've tried to read the documentation and other questions here, but I can't find any answer. 我已经尝试在此处阅读文档和其他问题,但是找不到任何答案。

So, I've got a dataframe with a small amount of rows (20 to couple of hundred) and a smaller number of columns. 因此,我有一个数据框,其中包含少量的行(20到几百个)和较少的列数。 I've used the read_table pandas function to get at the original data in .txt form, which looks like this: 我已经使用过read_table pandas函数以.txt格式获取原始数据,如下所示:

[ID1, Gene1, Sequence1, Ratio1, Ratio2, Ratio3]
[ID1, Gene1, Sequence2, Ratio1, Ratio2, Ratio3]
[ID2, Gene2, Sequence3, Ratio1, Ratio2, Ratio3]
[ID2, Gene3, Sequence4, Ratio1, Ratio2, Ratio3]
[ID3, Gene3, Sequence5, Ratio1, Ratio2, Ratio3]

... along with a whole bunch of unimportant columns. ...以及一大堆不重要的列。

What I want to be able to do is to select all the ratios from each Sequence and perform some calculations and statistics on them (all 3 ratios for each sequence, that is). 我想要做的是从每个序列中选择所有比率,并对它们进行一些计算和统计(也就是说,每个序列都具有3个比率)。 I've tried 我试过了

df.groupby('Sequence')
for col in df:
    do something / print(col) / print(col[0])

... but that only makes me more confused. ...但这只会让我更加困惑。 If I pass print(col), I get some kind of df construct printed, whereas if I pass print(col[0]), I only get the sequences. 如果我通过print(col),我会得到某种df构造的打印,而如果我通过print(col [0]),我只会得到序列。 As far as I can see in the construct, I should still have all the other columns and their data, since groupby() doesn't remove any data, it just groups it by some input column. 据我在构造中看到的那样,我仍然应该拥有所有其他列及其数据,因为groupby()不会删除任何数据,而是仅按某个输入列对其进行分组。 What am I doing wrong? 我究竟做错了什么?

Although I haven't gotten that far yet, due to the problems above, I also want my script to be able to select all the ratios for every ID and perform the same calculations on them, but this time every ratio by itself (ie Ratio1 for all rows of ID1, the same for Ratio2, etc.). 尽管我还没走那么远,但由于上述问题,我也希望我的脚本能够为每个ID选择所有比率并对它们进行相同的计算,但是这次是每个比率本身(即Ratio1对于ID1的所有行,对于Ratio2等,等等。 And, lastly, do the same thing for every gene. 最后,对每个基因都做同样的事情。

EDIT: 编辑:

So, say I want to perform this calculation on every ratio in the row, and then take the median of the three resulting values: 因此,假设我要对行中的每个比率执行此计算,然后取三个结果值的中位数:

df[Value1] = spike[data['ID']] / float(data['Ratio 1]) * (10**-12) * (6.022*10**23) / (1*10**6)
df[Value2] = spike[data['ID']] / float(data['Ratio 2]) * (10**-12) * (6.022*10**23) / (1*10**6)
df[Value3] = spike[data['ID']] / float(data['Ratio 3]) * (10**-12) * (6.022*10**23) / (1*10**6)

... where spike is a dictionary, and the keys are the IDs. ...其中穗是字典,键是ID。 Ignoring the dict part, I can make calculations (thanks!), but how do I access the dictionary using the dataframe IDs? 忽略字典部分,我可以进行计算(谢谢!),但是如何使用数据框ID访问字典? With the above code, I just get a "Unhashable type: Series" error. 使用上面的代码,我仅收到“ Unhashable type:Series”错误。

Here's some real data: 这是一些真实数据:

ID  Gene    Sequence    Ratio1  Ratio2  Ratio3
1   KRAS    SFEDXXYR    15.822  14.119  14.488
2   KRAS    VEDAXXXLVR  9.8455  8.9279  16.911
3   ELK4    IEXXXCESLNK 15.745  7.9122  9.5966
3   ELK4    IEGXXXSLNKR 1.177   NaN     12.073
  1. df.groupby() does not modify/group df in place. df.groupby()不会在原位置修改/分组df So you have to assign the result to a new variable to further use it. 因此,您必须将结果分配给新变量才能进一步使用它。 Eg : 例如:

     grouped = df.groupby('Sequence') 

    BTW, in the example data you give, all data in the Sequence column are unique, so grouping on that column will not do much. 顺便说一句,在您提供的示例数据中,“ Sequence列中的所有数据都是唯一的,因此对该列进行分组不会有太大作用。
    Furthermore, you normally don't need to 'iterate over the df' as you do here. 此外,您通常不需要像在此那样“遍历df”。 To apply a function to all groups, you can do that directly on the groupby result, eg df.groupby().apply(..) or df.groupby().aggregate(..) . 要将功能应用于所有组,可以直接在groupby结果上执行此操作,例如df.groupby().apply(..)df.groupby().aggregate(..)

  2. Can you give a more specific example of what kind of function you want to apply to the ratios? 您能否举一个更具体的例子说明要对比率应用哪种功能?

    To calculate the median of the three ratio's for each sequence (each row), you can do: 要计算每个序列(每一行)的三个比率的中位数,您可以执行以下操作:

     df[['Ratio1', 'Ratio2', 'Ratio3']].median(axis=1) 

    The axis=1 means that you do not want to take the median of one column (over the rows), but for each row (over the columns) axis=1表示您不希望获取一列(行中)的中位数,而是获取每一行(列中)的中位数

Another examle, to calculate the median of all Ratio1's for each ID, you can do: 另一个示例,要计算每个ID的所有Ratio1的中位数,您可以执行以下操作:

df.groupby('ID')['Ratio1'].median()

Here you group by ID , select column Ratio1 and calculate the median value for each group. 在这里,您可以按ID分组,选择列Ratio1并计算每组的中位数。


UPDATE: you should probably split the questions into seperate ones, but as an answer to your new question: 更新:您可能应该将问题分解为单独的问题,但作为对新问题的解答:

data['ID'] will give you the ID column, so you cannot use it as a key. data['ID']将为您提供ID列,因此您不能将其用作键。 You want one specific value of that column. 您需要该列的一个特定值。 To apply a function on each row of a dataframe, you can use apply : 要将功能应用于数据框的每一行,可以使用apply

def my_func(row):
    return spike[row['ID']] / float(row['Ratio 1']) * (10**-12) * (6.022*10**23) / (1*10**6)

df['Value1'] = df.apply(my_func, axis=1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM