[英]Read data with NAs into python and calculate mean row-wise
I am reading in data from a csvfile and attempt to calculate the mean columnwise. 我从csvfile读取数据,并尝试按列计算平均值。 While the number of columns is fixed, the number of rows isn't.
虽然列数是固定的,但行数不是固定的。 Therefore I first read in the rows I need, make them a list and then form a numpy array of the list.
因此,我首先读入需要的行,将它们制成列表,然后形成该列表的numpy数组。 But it doesn't work.
但这是行不通的。
import csv
import numpy
Reading in (loops through every file and find matches, which will then be appended): 读入(循环遍历每个文件并找到匹配项,然后将其附加):
with open(input_file, mode='r') as f:
reader = csv.reader(f, delimiter=';')
for row in reader:
pass
# matchin algorithm omitted
found_line = row
del found_line[0] #remove first entry on name
input_file
looks like input_file
看起来像
Weihnachtsmann;16;30.3125;0.00677830307346;0.000491988890358;0.2796728754;0.00371057513915;0.000667111407605;0.00177896375361
Tannenbaum;6;33.5;0.032918005099;0.00312809941211;0.308224811515;0.0124857679873;0.00644874360685;0.000667111407605
Heilier Klaus;1;NA;NA;NA;NA;NA;NA;NA
Then, I make a list out of the entries that match: 然后,从匹配的条目中列出一个列表:
author_list.append(','.join(found_line))
author_array = numpy.array(author_list)
I am not creating the numpy array in the first place because I heard it's unpythonic and slow to append to numpy arrays. 我不是首先创建numpy数组,因为我听说它附加到numpy数组中是非Python且缓慢。
print author_arry
yields 产量
['1,NA,NA,NA,NA,NA,NA' '6;33.5;0.032918005099;0.00312809941211;0.308224811515;0.0124857679873;0.00644874360685;0.000667111407605' '16;30.3125;0.00677830307346;0.000491988890358;0.2796728754;0.00371057513915;0.000667111407605;0.00177896375361']
but I am not even sure if that's an array with the dimensions I want (should be exactly eight columns) or just one colum and three rows. 但是我什至不确定这是一个具有我想要的尺寸的数组(应该是八列)还是一列三行。
Afterwards, I have to convert the NA
s that come from R
into numpy's NaN
(if I am correctly) and I don't know how to do that. 然后,我必须将来自
R
的NA
转换为numpy的NaN
(如果我正确的话),我不知道该怎么做。 I tried 我试过了
[author_entry.replace('NA','nan') for author_entry in author_list]
but I get an error. 但我得到一个错误。
There are a number of different ways you could read in the data from the file using NumPy. 您可以使用NumPy从文件中读取数据的方式有多种。 Here's one way using
np.genfromtxt
. 这是使用
np.genfromtxt
的一种方法。 The names in the first column become NumPy nan
values, as do any other non-float strings in your file: 第一列中的名称将成为NumPy
nan
值,文件中的任何其他非浮点字符串也是如此:
>>> arr = np.genfromtxt(input_file, delimiter=';', dtype=np.float64)
>>> arr
array([[ nan, 1.60000000e+01, 3.03125000e+01,
6.77830307e-03, 4.91988890e-04, 2.79672875e-01,
3.71057514e-03, 6.67111408e-04, 1.77896375e-03],
[ nan, 6.00000000e+00, 3.35000000e+01,
3.29180051e-02, 3.12809941e-03, 3.08224812e-01,
1.24857680e-02, 6.44874361e-03, 6.67111408e-04],
[ nan, 1.00000000e+00, nan,
nan, nan, nan,
nan, nan, nan]])
This is an array with 3 rows and 9 columns. 这是一个3行9列的数组。 To remove the first entry on each line, you could just slice and reassign with
arr = arr[:, 1:]
. 要删除每一行的第一个条目,您可以使用
arr = arr[:, 1:]
切片和重新分配。
You can calculate the row-wise mean using np.nanmean
(to ignore the nan
values when calculating the mean): 您可以使用
np.nanmean
计算按行平均值(在计算平均值时忽略nan
值):
>>> np.nanmean(arr, axis=1)
array([ 5.82569998, 4.98298407, 1. ])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.