[英]grouping and averaging the rows based on specific columns in python- rephrase
我有一個大的tab separated
文件,如下所示:
chr1 9507728 9517729 0 chr1 9507728 9517729 5S_rRNA
chr1 9537731 9544392 0 chr1 9537731 9547732 5S_rRNA
chr1 9497727 9507728 0 chr1 9497727 9507728 5S_rRNA
chr1 9517729 9527730 0 chr1 9517729 9527730 5S_rRNA
chr8 1118560 1118591 1 chr8 1112435 1122474 AK128400
chr8 1118591 1121351 0 chr8 1112435 1122474 AK128400
chr8 1121351 1121382 1 chr8 1112435 1122474 AK128400
chr8 1132513 1142552 0 chr8 1132513 1142552 AK128400
chr19 53436277 53446295 0 chr19 53436277 53446295 AK128361
chr19 53456313 53465410 0 chr19 53456313 53466331 AK128361
chr19 53465410 53465441 1 chr19 53456313 53466331 AK128361
chr19 53466331 53476349 0 chr19 53466331 53476349 AK128361
根據最后一列,有3組,每組有4行。 基於第四列的值,我想獲得每組第一行,每組第二行,每組第三行和每組第四行的平均值。 因此,在預期的輸出中,我將有4行(因為每個組有4行)和2列。 第一列是ID,在此示例中將為1、2、3和4。第二列為我提到的應如何計算的平均值。
expected output
:
1 0.33
2 0
3 0.66
4 0
我正在嘗試使用以下命令在python 2.7中做到這一點:
file = open('myfile.txt', 'r')
average = []
for i in file:
ave = i[3]/3
average.append(ave)
這僅返回一個錯誤的數字。 您知道如何解決它以獲得預期的輸出嗎?
這是一種方法:
with open("myfile.txt") as inFile:
lines = [" ".join(line.split()) for line in inFile]
s=0
for i in range(4):
for j in range(0,9,4):
s += int(lines[i + j].split()[3])
avg = s / 3
print("%d %.2f" % (i+1, avg))
s=0
輸出:
1 0.33
2 0.00
3 0.67
4 0.00
或者您可以使用列表理解:
with open("myfile.txt") as inFile:
lines = [" ".join(line.split()) for line in inFile]
s = [sum([int(lines[i + j].split()[3]) for j in range(0,9,4)]) for i in range(4)]
avg = [elem / 3 for elem in s]
for i, value in enumerate(avg):
print("%d %.2f" % (i+1, value))
請記住,以上代碼段均以您在問題中提供的確切數據格式進行了測試。
如果將數據讀取到pandas.DataFrame
則非常簡單。
import pandas as pd
# name the columns, makes the rest of the code easier to understand
bed_columns = ['chrA','startA','endA','the_value','chrB','startB','endB','group_name']
# read in the file
df = pd.read_csv('myfile.txt',sep=None,header=None,names=bed_columns)
# incrementing count within each group:
df['position_in_group'] = df.groupby(['group_name']).cumcount()
# average value for each count
desired_output = df.groupby(['position_in_group'])['the_value'].mean()
不固定行數和最后一行記錄的解決方案。
final_dict = {}
count_dict = {}
with open("input_file.txt",'r') as fh:
for line in fh:
data = line.rstrip('\n').split()
code = data[7]
count_dict[code] = count_dict.get(code,0) +1
final_dict[count_dict[code]] = final_dict.get(count_dict[code],{})
final_dict[count_dict[code]]['sum'] = final_dict[count_dict[code]].get('sum',0) + int(data[3])
final_dict[count_dict[code]]['count'] = final_dict[count_dict[code]].get('count',0) + 1
for key,value in final_dict.items():
avg = value['sum']/value['count']
print("{} {:f}".format(key,avg))
輸出:
1 0.333333
2 0.000000
3 0.666667
4 0.000000
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.