[英]how to group the rows and sum the values in one column in python
I have a tab separated file like this example: 我有一个制表符分隔的文件,例如以下示例:
small example: 小例子:
chr5 112312630 112312650 31 chr5 112312630 112321662 DCP2 ENST00000543319.1
chr5 137676883 137676900 123 chr5 137676883 137676949 FAM53C ENST00000434981.2
chr5 137676900 137676949 42 chr5 137676883 137676949 FAM53C ENST00000434981.2
chr5 139944400 139944450 92 chr5 139944064 139946344 SLC35A4 ENST00000323146.3
chr5 139945450 139945500 77 chr5 139944064 139946344 SLC35A4 ENST00000323146.3
I want to group the lines based on 5th
, 6th
and 7th
columns and sum the values of 4th
column in each group. 我想根据5th
列, 6th
列和6th
7th
列对行进行分组,并对每组中4th
列的值求和。 here is the expected output: 这是预期的输出:
expected output: 预期输出:
chr5 112312630 112312650 31 chr5 112312630 112321662 DCP2 ENST00000543319.1
chr5 137676900 137676949 165 chr5 137676883 137676949 FAM53C ENST00000434981.2
chr5 139944400 139944450 169 chr5 139944064 139946344 SLC35A4 ENST00000323146.3
I am trying to do that in python using the following command but it does not really work. 我正在尝试使用以下命令在python中执行此操作,但它实际上没有用。 do you know how to fix it? 你知道如何解决吗?
import pandas as pd
df = pd.read_csv('myfile.txt', sep='\t', header=None)
df = df.groupby(5, 6, 7, 8).sum()
您只需要对cols进行分组:
df.groupby([5,6,7,8]).sum()
You need aggregate by DataFrameGroupBy.agg
with dictionary of columns with aggregated functions, here all column different by cols
are aggregate by last
or first
, only 4
column is aggregated by sum
: 您需要通过DataFrameGroupBy.agg
用具有聚合功能的列的字典进行聚合,此处所有cols
不同的列都按last
或first
进行聚合,只有4
列通过sum
聚合:
cols = [5, 6, 7, 8]
d = dict.fromkeys(df.columns.difference(cols), 'last')
d[3] = 'sum'
print (d)
{0: 'last', 1: 'last', 2: 'last', 3: 'sum', 4: 'last'}
df = df.groupby([5, 6, 7, 8], as_index=False).agg(d).reindex(columns=df.columns)
print (df)
0 1 2 3 4 5 6 7 \
0 chr5 112312630 112312650 31 chr5 112312630 112321662 DCP2
1 chr5 137676900 137676949 165 chr5 137676883 137676949 FAM53C
2 chr5 139945450 139945500 169 chr5 139944064 139946344 SLC35A4
8
0 ENST00000543319.1
1 ENST00000434981.2
2 ENST00000323146.3
cols = [5, 6, 7, 8]
d = dict.fromkeys(df.columns.difference(cols), 'first')
d[3] = 'sum'
print (d)
{0: 'first', 1: 'first', 2: 'first', 3: 'sum', 4: 'first'}
df = df.groupby([5, 6, 7, 8], as_index=False).agg(d).reindex(columns=df.columns)
print (df)
0 1 2 3 4 5 6 7 \
0 chr5 112312630 112312650 31 chr5 112312630 112321662 DCP2
1 chr5 137676883 137676900 165 chr5 137676883 137676949 FAM53C
2 chr5 139944400 139944450 169 chr5 139944064 139946344 SLC35A4
8
0 ENST00000543319.1
1 ENST00000434981.2
2 ENST00000323146.3
Try this: 尝试这个:
df.groupby(['column'])[['another column']].sum()
It groups by column
and add sum of another column
. 它按column
分组并加上another column
总和。 I used []
so that you understand you can group by multiple columns, like this: 我使用了[]
以便您了解可以按多个列进行分组,如下所示:
df.groupby(['column1', 'column2'])
Input Dataframe: considering only first 3 rows, 输入数据框:仅考虑前3行,
data = {'col1': ['chr5', 'chr5', 'chr5'],
'col2': [112312630,137676883,137676900],
'col3': [112312650,137676900,137676949],
'col4': [31, 123,42],
'col5': ['chr5', 'chr5', 'chr5'],
'col6': [112312630 ,137676883 ,137676883 ],
'col7': [112321662, 137676949, 137676949],
'col8': ['DCP2', 'FAM53C', 'FAM53C'],
'col9': ['ENST00000543319.1', 'ENST00000434981.2', 'ENST00000434981.2']
}
df = pd.DataFrame(data = data)
df
Do like this, 这样吧
cols = ['col5', 'col6', 'col7', 'col8']
col_sum = df.groupby(cols)['col4'].sum()
col_sum
Output: this is a multi-level dataframe. 输出:这是一个多级数据框。 Last column is your output, 最后一列是您的输出,
col5 col6 col7 col8
chr5 112312630 112321662 DCP2 31
137676883 137676949 FAM53C 165
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.