简体   繁体   English

使用NA将数据读取到python中并按行计算均值

[英]Read data with NAs into python and calculate mean row-wise

I am reading in data from a csvfile and attempt to calculate the mean columnwise. 我从csvfile读取数据,并尝试按列计算平均值。 While the number of columns is fixed, the number of rows isn't. 虽然列数是固定的,但行数不是固定的。 Therefore I first read in the rows I need, make them a list and then form a numpy array of the list. 因此,我首先读入需要的行,将它们制成列表,然后形成该列表的numpy数组。 But it doesn't work. 但这是行不通的。

import csv
import numpy

Reading in (loops through every file and find matches, which will then be appended): 读入(循环遍历每个文件并找到匹配项,然后将其附加):

with open(input_file, mode='r') as f:
    reader = csv.reader(f, delimiter=';')
    for row in reader:
        pass
        # matchin algorithm omitted
        found_line = row
        del found_line[0] #remove first entry on name

input_file looks like input_file看起来像

Weihnachtsmann;16;30.3125;0.00677830307346;0.000491988890358;0.2796728754;0.00371057513915;0.000667111407605;0.00177896375361
Tannenbaum;6;33.5;0.032918005099;0.00312809941211;0.308224811515;0.0124857679873;0.00644874360685;0.000667111407605
Heilier Klaus;1;NA;NA;NA;NA;NA;NA;NA

Then, I make a list out of the entries that match: 然后,从匹配的条目中列出一个列表:

author_list.append(','.join(found_line))
author_array = numpy.array(author_list)

I am not creating the numpy array in the first place because I heard it's unpythonic and slow to append to numpy arrays. 我不是首先创建numpy数组,因为我听说它附加到numpy数组中是非Python且缓慢。

print author_arry

yields 产量

['1,NA,NA,NA,NA,NA,NA' '6;33.5;0.032918005099;0.00312809941211;0.308224811515;0.0124857679873;0.00644874360685;0.000667111407605' '16;30.3125;0.00677830307346;0.000491988890358;0.2796728754;0.00371057513915;0.000667111407605;0.00177896375361']

but I am not even sure if that's an array with the dimensions I want (should be exactly eight columns) or just one colum and three rows. 但是我什至不确定这是一个具有我想要的尺寸的数组(应该是八列)还是一列三行。

Afterwards, I have to convert the NA s that come from R into numpy's NaN (if I am correctly) and I don't know how to do that. 然后,我必须将来自RNA转换为numpy的NaN (如果我正确的话),我不知道该怎么做。 I tried 我试过了

[author_entry.replace('NA','nan') for author_entry in author_list]

but I get an error. 但我得到一个错误。

There are a number of different ways you could read in the data from the file using NumPy. 您可以使用NumPy从文件中读取数据的方式有多种。 Here's one way using np.genfromtxt . 这是使用np.genfromtxt的一种方法。 The names in the first column become NumPy nan values, as do any other non-float strings in your file: 第一列中的名称将成为NumPy nan值,文件中的任何其他非浮点字符串也是如此:

>>> arr = np.genfromtxt(input_file, delimiter=';', dtype=np.float64)
>>> arr
array([[             nan,   1.60000000e+01,   3.03125000e+01,
          6.77830307e-03,   4.91988890e-04,   2.79672875e-01,
          3.71057514e-03,   6.67111408e-04,   1.77896375e-03],
       [             nan,   6.00000000e+00,   3.35000000e+01,
          3.29180051e-02,   3.12809941e-03,   3.08224812e-01,
          1.24857680e-02,   6.44874361e-03,   6.67111408e-04],
       [             nan,   1.00000000e+00,              nan,
                     nan,              nan,              nan,
                     nan,              nan,              nan]])

This is an array with 3 rows and 9 columns. 这是一个3行9列的数组。 To remove the first entry on each line, you could just slice and reassign with arr = arr[:, 1:] . 要删除每一行的第一个条目,您可以使用arr = arr[:, 1:]切片和重新分配。

You can calculate the row-wise mean using np.nanmean (to ignore the nan values when calculating the mean): 您可以使用np.nanmean计算按行平均值(在计算平均值时忽略nan值):

>>> np.nanmean(arr, axis=1)
array([ 5.82569998,  4.98298407,  1.        ])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM