使用NA将数据读取到python中并按行计算均值

Question

I am reading in data from a csvfile and attempt to calculate the mean columnwise. 我从csvfile读取数据，并尝试按列计算平均值。 While the number of columns is fixed, the number of rows isn't. 虽然列数是固定的，但行数不是固定的。 Therefore I first read in the rows I need, make them a list and then form a numpy array of the list. 因此，我首先读入需要的行，将它们制成列表，然后形成该列表的numpy数组。 But it doesn't work. 但这是行不通的。

import csv
import numpy

Reading in (loops through every file and find matches, which will then be appended): 读入（循环遍历每个文件并找到匹配项，然后将其附加）：

with open(input_file, mode='r') as f:
    reader = csv.reader(f, delimiter=';')
    for row in reader:
        pass
        # matchin algorithm omitted
        found_line = row
        del found_line[0] #remove first entry on name

input_file looks like input_file看起来像

Weihnachtsmann;16;30.3125;0.00677830307346;0.000491988890358;0.2796728754;0.00371057513915;0.000667111407605;0.00177896375361
Tannenbaum;6;33.5;0.032918005099;0.00312809941211;0.308224811515;0.0124857679873;0.00644874360685;0.000667111407605
Heilier Klaus;1;NA;NA;NA;NA;NA;NA;NA

Then, I make a list out of the entries that match: 然后，从匹配的条目中列出一个列表：

author_list.append(','.join(found_line))
author_array = numpy.array(author_list)

I am not creating the numpy array in the first place because I heard it's unpythonic and slow to append to numpy arrays. 我不是首先创建numpy数组，因为我听说它附加到numpy数组中是非Python且缓慢。

print author_arry

yields 产量

['1,NA,NA,NA,NA,NA,NA' '6;33.5;0.032918005099;0.00312809941211;0.308224811515;0.0124857679873;0.00644874360685;0.000667111407605' '16;30.3125;0.00677830307346;0.000491988890358;0.2796728754;0.00371057513915;0.000667111407605;0.00177896375361']

but I am not even sure if that's an array with the dimensions I want (should be exactly eight columns) or just one colum and three rows. 但是我什至不确定这是一个具有我想要的尺寸的数组（应该是八列）还是一列三行。

Afterwards, I have to convert the NA s that come from R into numpy's NaN (if I am correctly) and I don't know how to do that. 然后，我必须将来自R的NA转换为numpy的NaN （如果我正确的话），我不知道该怎么做。 I tried 我试过了

[author_entry.replace('NA','nan') for author_entry in author_list]

but I get an error. 但我得到一个错误。

Answer 1

There are a number of different ways you could read in the data from the file using NumPy. 您可以使用NumPy从文件中读取数据的方式有多种。 Here's one way using np.genfromtxt . 这是使用np.genfromtxt的一种方法。 The names in the first column become NumPy nan values, as do any other non-float strings in your file: 第一列中的名称将成为NumPy nan值，文件中的任何其他非浮点字符串也是如此：

>>> arr = np.genfromtxt(input_file, delimiter=';', dtype=np.float64)
>>> arr
array([[             nan,   1.60000000e+01,   3.03125000e+01,
          6.77830307e-03,   4.91988890e-04,   2.79672875e-01,
          3.71057514e-03,   6.67111408e-04,   1.77896375e-03],
       [             nan,   6.00000000e+00,   3.35000000e+01,
          3.29180051e-02,   3.12809941e-03,   3.08224812e-01,
          1.24857680e-02,   6.44874361e-03,   6.67111408e-04],
       [             nan,   1.00000000e+00,              nan,
                     nan,              nan,              nan,
                     nan,              nan,              nan]])

This is an array with 3 rows and 9 columns. 这是一个3行9列的数组。 To remove the first entry on each line, you could just slice and reassign with arr = arr[:, 1:] . 要删除每一行的第一个条目，您可以使用arr = arr[:, 1:]切片和重新分配。

You can calculate the row-wise mean using np.nanmean (to ignore the nan values when calculating the mean): 您可以使用np.nanmean计算按行平均值（在计算平均值时忽略nan值）：

>>> np.nanmean(arr, axis=1)
array([ 5.82569998,  4.98298407,  1.        ])

使用NA将数据读取到python中并按行计算均值

问题描述

1 个解决方案

解决方案1
2 已采纳 2014-12-31 12:20:21

使用NA将数据读取到python中并按行计算均值

问题描述

1 个解决方案

解决方案1 2 已采纳 2014-12-31 12:20:21

解决方案1
2 已采纳 2014-12-31 12:20:21