Pandas groupby nunique输出到列表

Question

I have as input a dataset like the following: 我输入了如下数据集：

labels = ['chrom', 'start', 'end', 'read']
my_data = [['chr1', 784344, 800125, 'read1'],
           ['chr1', 784344, 800124, 'read2'],
           ['chr1', 784344, 800124, 'read3']]

Which I convert to a pandas dataframe using: 使用以下内容转换为pandas数据帧：

my_data_pd = pd.DataFrame.from_records(my_data, columns=labels)

and that looks like this: 这看起来像这样：

  chrom   start     end   read
0  chr1  784344  800125  read1
1  chr1  784344  800124  read2
2  chr1  784344  800124  read3

What I want to do is the following: I wan't merge the rows that have indentical chrom,start,end values, and count the number of disntinct occurences of the values in the 'read' column for those rows that were merged. 我想要做的是以下内容：我不会合并具有缩进的chrom，start，end值的行，并计算那些合并的行的“read”列中值的意外出现次数。 Finally, I want to convert convert that output to a list/tuple, as in this example (note that the last column has the count information): 最后，我想将转换输出转换为list / tuple，如本例所示（注意最后一列有计数信息）：

[('chr1', 784344, 800125,1), ('chr1', 784344, 800124,2)]

What I have been able to do: 我能做到的：

Unsing Pandas Groupby and the nunique() with the command: 使用命令解开Pandas Groupby和nunique（） ：

my_data_pd.groupby(['chrom','start','end'],sort=False).read.nunique()

I arrive to a Pandas.Series object that looks to what I want: 我到达了一个看起来像我想要的Pandas.Series对象：

chrom  start   end   
chr1   784344  800125    1
               800124    2
Name: read, dtype: int64

However, when I convert it to a list/tuple using: 但是，当我使用以下命令将其转换为list / tuple时：

 sortedd.index.tolist()

the last column gets excluded, leading to the resulting output: 排除最后一列，导致结果输出：

[('chr1', 784344, 800125), ('chr1', 784344, 800124)]

Any idea about how can I get around trough this problem? 关于如何解决这个问题的任何想法？

For all those that might come up with a solution, I am doing this inside a big program thousands of times, so speed is a big issue. 对于那些可能提出解决方案的人来说，我在一个大型程序中做了好几千次，所以速度是个大问题。 Thats the reason I am avoiding other tools like BedTools and pybedtools 这就是我避免使用BedTools和pybedtools等其他工具的原因

Thanks! 谢谢！

Answer 1

You can set_index 你可以set_index

sortedd.to_frame('val').set_index('val',append=True).index.tolist()
Out[277]: [('chr1', 784344, 800125, 1), ('chr1', 784344, 800124, 2)]

Answer 2

First reset_index and then in list comprehension convert to tuples : 首先reset_index然后在list comprehension reset_index中转换为tuples ：

L = [tuple(x) for x in sortedd.reset_index().values.tolist()]
print (L)
[('chr1', 784344, 800125, 1), ('chr1', 784344, 800124, 2)]

Answer 3

You can use multi index ie 你可以使用多索引即

idx = pd.MultiIndex.from_arrays(sortedd.reset_index().values.T)

idx.tolist()
[('chr1', 784344, 800125, 1), ('chr1', 784344, 800124, 2)]

Pandas groupby nunique输出到列表

问题描述

3 个解决方案

解决方案1
3 2018-01-31 15:22:55

解决方案2
3 已采纳 2018-01-31 15:23:10

解决方案3
3 2018-01-31 15:24:33

Pandas groupby nunique输出到列表

问题描述

3 个解决方案

解决方案1 3 2018-01-31 15:22:55

解决方案2 3 已采纳 2018-01-31 15:23:10

解决方案3 3 2018-01-31 15:24:33

解决方案1
3 2018-01-31 15:22:55

解决方案2
3 已采纳 2018-01-31 15:23:10

解决方案3
3 2018-01-31 15:24:33