[英]Python: Get Average values from multiple columns in multiple files
I am trying to write a program which will take as input one or more files and summarize the average values coming from 2 columns in each file. 我正在尝试编写一个程序,该程序将一个或多个文件作为输入并汇总每个文件中两列的平均值。
for example I have 2 files: 例如我有2个文件:
File1: 文件1:
ID Feature Total Percent
1.2 ABC 300 75
1.4 CDE 129 68
File2: 文件2:
ID Feature Total Percent
1.2 ABC 289 34
1.4 CDE 56 94
I want to iterate over each file and convert the percent to a number using: 我想遍历每个文件并将百分比转换为数字,使用:
def ReadFile(File):
LineCount = 0
f = open(File)
Header = f.readline()
Lines = f.readlines()
for Line in Lines:
Info = Line.strip("\n").split("\t")
ID, Feature, Total, Percent= Info[0], Info[1], int(Info[2]), int(Info[3])
Num = (Percent/100.0)*Total
I'm not sure what's the best way to store the output so that I have the ID, Feature, Total and Percent for each file. 我不确定什么是存储输出的最佳方法,因此我没有每个文件的ID,功能,总计和百分比。 Ultimately, I would like to create an outfile that contains the average percent over all files.
最终,我想创建一个包含所有文件平均百分比的输出文件。 In the above example I would get:
在上面的示例中,我将得到:
ID Feature AveragePercent
1.2 ABC 54.9 #(((75/100.0)*300)+((34/100.0)*289)) / (300+289))
1.4 CDE 75.9 #(((68/100.0)*129)+((94/100.0)*56)) / (129+56))
Pandas
module is the way to go. Pandas
模块是要走的路。 Assuming that your files are named '1.txt'
and '2.txt'
, the following code will store all your input, output, and intermediate computations in pandas' DataFrame
instance df
. 假设您的文件名为
'1.txt'
和'2.txt'
,以下代码会将所有输入,输出和中间计算存储在pandas的DataFrame
实例df
。 Additionally, the information of interest will be printed to the file 'out.txt'
. 此外,感兴趣的信息将被打印到文件
'out.txt'
。
import pandas as pd
import numpy as np
file_names = ['1.txt', '2.txt']
df = None
for f_name in file_names:
df_tmp = pd.read_csv(f_name, sep = '\t')
df = df_tmp if df is None else pd.concat([df,df_tmp])
df['Absolute'] = df['Percent'] * df['Total']
df['Sum_Total'] = df.groupby('Feature')['Total'].transform(np.sum)
df['Sum_Absolute'] = df.groupby('Feature')['Absolute'].transform(np.sum)
df['AveragePercent'] = df['Sum_Absolute'] / df['Sum_Total']
df_out = df[['ID','Feature','AveragePercent']].drop_duplicates()
df_out.to_csv('out.txt', sep = "\t", index = False)
A dictionary will be perfect for this.(I've left the header handling part for you) 字典将是完美的选择。(我已将标题处理部分留给您使用)
import fileinput
data = {}
for line in fileinput.input(['file1', 'file2']):
idx, ft, values = line.split(None, 2)
key = idx, ft #use ID, Feature tuple as a key.
tot, per = map(int, values.split())
if key not in data:
data[key] = {'num': 0, 'den': 0}
data[key]['num'] += (per/100.0) * tot
data[key]['den'] += tot
Now data
contains: 现在
data
包含:
{('1.2', 'ABC'): {'num': 323.26, 'den': 589},
('1.4', 'CDE'): {'num': 140.36, 'den': 185}}
Now we can loop over this dict and calculate the desired result: 现在我们可以遍历此字典并计算所需的结果:
for (idx, ft), v in data.items():
print idx, ft, round(v['num']/v['den']*100, 1)
Output: 输出:
1.2 ABC 54.9
1.4 CDE 75.9
You'll need to store some data across reading the files. 您需要在读取文件时存储一些数据。 Say you have a list of file paths in a variable called
files
假设您有一个名为
files
的变量中的文件路径列表
data = {}
for filepath in files:
f = open(filepath, "r")
f.readline()
for line in f.readlines():
info = line.strip().split("\t")
id, feature, total, percent = info[0], info[1], int(info[2]), int(info[3])
if id in data:
data[id].total += total * (percent / 100.0)
data[id].count += total
else:
data[id] = {"feature": feature, "total": total * (percent / 100.0), "count": total}
# Output
out = open("outfile", "w")
out.write("ID\tFeature\tAveragePercentage")
for id in data:
out.write(str(id) + "\t" + data.feature + "\t" + str(data.total / data.count) + "\n")
I have tested this using files with ID, Feature, Total, Percent deliminated with tabs (like your input file) and works great, giving output you want: 我已经使用ID,功能,总计,用制表符分隔的百分比(例如您的输入文件)的文件进行了测试,并且效果很好,可以提供所需的输出:
globalResultsFromReadDictionary = {}
def ReadFile(File):
LineCount = 0
f = open(File)
Header = f.readline()
Lines = f.readlines()
for Line in Lines:
Info = Line.strip("\n").split("\t")
ID, Feature, Total, Percent = Info[0], Info[1], int(Info[2]), int(Info[3])
#Adding to dictionary
key = ID + "\t" + Feature
if(key in globalResultsFromReadDictionary):
globalResultsFromReadDictionary[key].append([Total, Percent])
else:
globalResultsFromReadDictionary[key] = [[Total, Percent]]
def createFinalReport(File):
overallReportFile = open(File, 'w'); #the file to write the report to
overallReportFile.write('ID\tFeature\tAvg%\n') #writing the header
for idFeatureCombinationKey in globalResultsFromReadDictionary:
#Tallying up the total and sum of percent*total for each element of the Id-Feature combination
sumOfTotals = 0
sumOfPortionOfTotals = 0
for totalPercentCombination in globalResultsFromReadDictionary[idFeatureCombinationKey]:
sumOfTotals += totalPercentCombination[0]
sumOfPortionOfTotals += (totalPercentCombination[0]*(totalPercentCombination[1]/100))
#Write to the line (idFeatureCombinationKey is 'ID \t Feature', so can just write that)
overallReportFile.write(idFeatureCombinationKey + '\t' + str(round((sumOfPortionOfTotals/sumOfTotals)*100, 1)) + '\n')
overallReportFile.close()
#Calling the functions
ReadFile('File1.txt');
ReadFile('File2.txt');
createFinalReport('dd.txt');
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.