根據python-phrase中的特定列對行進行分組和平均

Question

我有一個大的tab separated文件，如下所示：

chr1    9507728 9517729 0   chr1    9507728 9517729 5S_rRNA
chr1    9537731 9544392 0   chr1    9537731 9547732 5S_rRNA
chr1    9497727 9507728 0   chr1    9497727 9507728 5S_rRNA
chr1    9517729 9527730 0   chr1    9517729 9527730 5S_rRNA
chr8    1118560 1118591 1   chr8    1112435 1122474 AK128400
chr8    1118591 1121351 0   chr8    1112435 1122474 AK128400
chr8    1121351 1121382 1   chr8    1112435 1122474 AK128400
chr8    1132513 1142552 0   chr8    1132513 1142552 AK128400
chr19   53436277    53446295    0   chr19   53436277    53446295    AK128361
chr19   53456313    53465410    0   chr19   53456313    53466331    AK128361
chr19   53465410    53465441    1   chr19   53456313    53466331    AK128361
chr19   53466331    53476349    0   chr19   53466331    53476349    AK128361

根據最后一列，有3組，每組有4行。 基於第四列的值，我想獲得每組第一行，每組第二行，每組第三行和每組第四行的平均值。 因此，在預期的輸出中，我將有4行（因為每個組有4行）和2列。 第一列是ID，在此示例中將為1、2、3和4。第二列為我提到的應如何計算的平均值。

expected output ：

我正在嘗試使用以下命令在python 2.7中做到這一點：

file = open('myfile.txt', 'r')
average = []
for i in file:
    ave = i[3]/3
    average.append(ave)

這僅返回一個錯誤的數字。 您知道如何解決它以獲得預期的輸出嗎？

Answer 1

這是一種方法：

with open("myfile.txt") as inFile:
    lines = [" ".join(line.split()) for line in inFile]
    s=0
    for i in range(4):
        for j in range(0,9,4):
            s += int(lines[i + j].split()[3])
        avg = s / 3
        print("%d   %.2f" % (i+1, avg))
        s=0

輸出：

或者您可以使用列表理解：

with open("myfile.txt") as inFile:
    lines = [" ".join(line.split()) for line in inFile]
    s = [sum([int(lines[i + j].split()[3]) for j in range(0,9,4)]) for i in range(4)]
    avg = [elem / 3 for elem in s]
    for i, value in enumerate(avg):
        print("%d   %.2f" % (i+1, value))

請記住，以上代碼段均以您在問題中提供的確切數據格式進行了測試。

Answer 2

如果將數據讀取到pandas.DataFrame則非常簡單。

import pandas as pd
# name the columns, makes the rest of the code easier to understand
bed_columns = ['chrA','startA','endA','the_value','chrB','startB','endB','group_name']

# read in the file
df = pd.read_csv('myfile.txt',sep=None,header=None,names=bed_columns)

# incrementing count within each group:
df['position_in_group'] = df.groupby(['group_name']).cumcount()

# average value for each count
desired_output = df.groupby(['position_in_group'])['the_value'].mean()

Answer 3

不固定行數和最后一行記錄的解決方案。

final_dict = {}
count_dict = {}
with open("input_file.txt",'r') as fh:
    for line in fh:
        data = line.rstrip('\n').split()
        code = data[7]
        count_dict[code] = count_dict.get(code,0) +1
        final_dict[count_dict[code]] = final_dict.get(count_dict[code],{})
        final_dict[count_dict[code]]['sum'] = final_dict[count_dict[code]].get('sum',0) + int(data[3])
        final_dict[count_dict[code]]['count'] = final_dict[count_dict[code]].get('count',0) + 1

for key,value in final_dict.items():
    avg = value['sum']/value['count']
    print("{} {:f}".format(key,avg))

輸出：

根據python-phrase中的特定列對行進行分組和平均

問題描述

3 個解決方案

解決方案1
0 2018-11-29 13:14:13

解決方案2
0 2018-11-29 13:26:51

解決方案3
0 2018-11-29 13:29:06

根據python-phrase中的特定列對行進行分組和平均

問題描述

3 個解決方案

解決方案1 0 2018-11-29 13:14:13

解決方案2 0 2018-11-29 13:26:51

解決方案3 0 2018-11-29 13:29:06

解決方案1
0 2018-11-29 13:14:13

解決方案2
0 2018-11-29 13:26:51

解決方案3
0 2018-11-29 13:29:06