简体   繁体   English

Python - 计算 csv 文件中每一列的平均值

[英]Python - Calculate average for every column in a csv file

I'm new in Python and I'm trying to get the average of every (column or row) of a csv file for then select the values that are higher than the double of the average of its column (o row).我是 Python 的新手,我正在尝试获取 csv 文件的每个(列或行)的平均值,然后 select 列的平均值高于其行的两倍。 My file have hundreds of columns, and have float values like these:我的文件有数百列,并且具有如下浮点值:

845.123,452.234,653.23,...
432.123,213.452.421.532,...
743.234,532,432.423,...

I've tried several changes to my code to get the average for every column (separately), but at the moment my code is like this one:我已经尝试对我的代码进行几处更改以(分别)获得每一列的平均值,但目前我的代码是这样的:

def AverageColumn (c):
    f=open(csv,"r")
    average=0
    Sum=0
    column=len(f)
    for i in range(0,column):
        for n in i.split(','):
            n=float(n)
            Sum += n
        average = Sum / len(column)
    return 'The average is:', average

    f.close()


csv="MDT25.csv"
print AverageColumn(csv)

But I always get a error like " f has no len()" or "'int' object is not iterable"...但我总是收到类似“f has no len()”或“'int' object is not iterable”之类的错误...

I'd really appreciate if someone show me how to get the average for every column (or row, as you want), and then select the values that are higher than the double of the average of its column (or row).如果有人告诉我如何获得每列(或行,如你所愿)的平均值,然后 select 值高于其列(或行)平均值的两倍,我将不胜感激。 I'd rather without importing modules as csv, but as you prefer.我宁愿不将模块作为 csv 导入,但如您所愿。 Thanks!谢谢!

Here's a clean up of your function, but it probably doesn't do what you want it to do. 这是一个清理你的功能,但它可能不会做你想做的事情。 Currently, it is getting the average of all values in all columns: 目前,它正在获取所有列中所有值的平均值:

def average_column (csv):
    f = open(csv,"r")
    average = 0
    Sum = 0
    row_count = 0
    for row in f:
        for column in row.split(','):
            n=float(column)
            Sum += n
        row_count += 1
    average = Sum / len(column)
    f.close()
    return 'The average is:', average

I would use the csv module (which makes csv parsing easier), with a Counter object to manage the column totals and a context manager to open the file (no need for a close() ): 我会使用csv模块(这使得csv解析更容易),使用Counter对象来管理列总数,使用上下文管理器来打开文件(不需要close() ):

import csv
from collections import Counter

def average_column (csv_filepath):
    column_totals = Counter()
    with open(csv_filepath,"rb") as f:
        reader = csv.reader(f)
        row_count = 0.0
        for row in reader:
            for column_idx, column_value in enumerate(row):
                try:
                    n = float(column_value)
                    column_totals[column_idx] += n
                except ValueError:
                    print "Error -- ({}) Column({}) could not be converted to float!".format(column_value, column_idx)                    
            row_count += 1.0            

    # row_count is now 1 too many so decrement it back down
    row_count -= 1.0

    # make sure column index keys are in order
    column_indexes = column_totals.keys()
    column_indexes.sort()

    # calculate per column averages using a list comprehension
    averages = [column_totals[idx]/row_count for idx in column_indexes]
    return averages

First of all, as people say - CSV format looks simple, but it can be quite nontrivial, especially once strings enter play. 首先,正如人们所说 - CSV格式看起来很简单,但它可能非常重要,特别是一旦字符串进入游戏。 monkut already gave you two solutions, the cleaned-up version of your code, and one more that uses CSV library. monkut已经为您提供了两个解决方案,清理后的代码版本以及另一个使用CSV库的解决方案。 I'll give yet another option: no libraries, but plenty of idiomatic code to chew on, which gives you averages for all columns at once. 我将给出另一个选择:没有库,但有大量惯用代码可供选择,它可以同时为所有列提供平均值。

def get_averages(csv):
    column_sums = None
    with open(csv) as file:
        lines = file.readlines()
        rows_of_numbers = [map(float, line.split(',')) for line in lines]
        sums = map(sum, zip(*rows_of_numbers))
        averages = [sum_item / len(lines) for sum_item in sums]
        return averages

Things to note: In your code, f is a file object. 注意事项:在您的代码中, f是一个文件对象。 You try to close it after you have already returned the value. 您已经返回值后尝试关闭它。 This code will never be reached: nothing executes after a return has been processed, unless you have a try...finally construct, or with construct (like I am using - which will automatically close the stream). 永远不会达到此代码:处理return后没有执行任何操作,除非您有try...finally结构,或者with构造(就像我正在使用 - 它将自动关闭流)。

map(f, l) , or equivalent [f(x) for x in l] , creates a new list whose elements are obtained by applying function f on each element on l . map(f, l)或等价的[f(x) for x in l]创建一个新的列表,其元素是通过在l上的每个元素上应用函数f获得的。

f(*l) will "unpack" the list l before function invocation, giving to function f each element as a separate argument. f(*l)将在函数调用之前“解包”列表l ,将每个元素作为单独的参数赋予函数f

If you want to do it without stdlib modules for some reason: 如果你想在没有stdlib模块的情况下出于某种原因这样做:

with open('path/to/csv') as infile:
    columns = list(map(float,next(infile).split(',')))
    for how_many_entries, line in enumerate(infile,start=2):
        for (idx,running_avg), new_data in zip(enumerate(columns), line.split(',')):
            columns[idx] += (float(new_data) - running_avg)/how_many_entries

I suggest breaking this into several smaller steps: 我建议把它分成几个较小的步骤:

  1. Read the CSV file into a 2D list or 2D array. 将CSV文件读入2D列表或2D阵列。
  2. Calculate the averages of each column. 计算每列的平均值。

Each of these steps can be implemented as two separate functions. 这些步骤中的每一个都可以实现为两个单独的功能。 (In a realistic situation where the CSV file is large, reading the complete file into memory might be prohibitive due to space constraints. However, for a learning exercise, this is a great way to gain an understanding of writing your own functions.) (在CSV文件较大的实际情况下,由于空间限制,将完整文件读入内存可能会令人望而却步。但是,对于学习练习,这是了解编写自己的函数的好方法。)

I hope this helps you out......Some help....here is what I would do - which is use numpy: 我希望这可以帮助你......一些帮助....这就是我要做的 - 这是使用numpy:

    # ==========================
    import numpy as np
    import csv as csv

    #  Assume that you have 2 columns and a header-row: The Columns are (1) 
    #  question # ...1; (2) question 2
    # ========================================

    readdata = csv.reader(open('filename.csv', 'r'))  #this is the file you 
    # ....will write your original file to....============
    data = []
    for row in readdata:
    data.append(row)
    Header = data[0]
    data.pop(0)
    q1 = []
    q2 = []
    # ========================================

    for i in range(len(data)):
        q1.append(int(data[i][1]))
        q2.append(int(data[i][2]))
    # ========================================
    # ========================================
    # === Means/Variance - Work-up Section ===
    # ========================================
    print ('Mean - Question-1:            ', (np.mean(q1)))
    print ('Variance,Question-1:          ', (np.var(q1)))
    print ('==============================================')
    print ('Mean - Question-2:            ', (np.mean(q2)))
    print ('Variance,Question-2:          ', (np.var(q2)))

This definitely worked for me! 这绝对适合我!

import numpy as np
import csv

readdata = csv.reader(open('C:\\...\\your_file_name.csv', 'r'))
data = []

for row in readdata:
  data.append(row)

#incase you have a header/title in the first row of your csv file, do the next line else skip it
data.pop(0) 

q1 = []  

for i in range(len(data)):
  q1.append(int(data[i][your_column_number]))

print ('Mean of your_column_number :            ', (np.mean(q1)))
import csv
from statistics import mean
with open(r'path/to/csv','r') as f:
    reader = csv.reader(f)
    print(mean([float(i[2]) for i in reader if i[2].isnumeric()]))

replace '2' with the index of the column you'd wish to calculate将“2”替换为您要计算的列的索引

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM