简体   繁体   English

NumPy genfromxt TypeError:数据类型无法理解错误

[英]NumPy genfromxt TypeError: data type not understood error

I would like to read in this file (test.txt) 我想读入这个文件(test.txt)

01.06.2015;00:00:00;0.000;0;-9.999;0;8;0.00;18951;(SPECTRUM)ZERO(/SPECTRUM)
01.06.2015;00:01:00;0.000;0;-9.999;0;8;0.00;18954;(SPECTRUM)ZERO(/SPECTRUM)
01.06.2015;00:02:00;0.000;0;-9.999;0;8;0.00;18960;(SPECTRUM)ZERO(/SPECTRUM)
01.06.2015;09:23:00;0.327;61;25.831;39;29;0.18;19006;01.06.2015;09:23:00;0.327;61;25.831;39;29;0.18;19006;(SPECTRUM);;;;;;;;;;;;;;1;1;;;1;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;1;;;;;;;;;;;;(/SPECTRUM)
01.06.2015;09:24:00;0.000;0;-9.999;0;29;0.00;19010;(SPECTRUM)ZERO(/SPECTRUM)

...I tried it with the numpy function genfromtxt() (see below in the code excerpt). ...我用numpy函数genfromtxt()进行了尝试(请参见下面的代码摘录)。

import numpy as np
col_names = ["date", "time", "rain_intensity", "weather_code_1", "radar_ref", "weather_code_2", "val6", "rain_accum", "val8", "val9"]
types = ["object", "object", "float", "uint8", "float", "uint8", "uint8", "float", "uint8","|S10"]
# Read in the file with np.genfromtxt
mydata = np.genfromtxt("test.txt", delimiter=";", names=col_names, dtype=types)

Now when I execute the code I get the following error --> 现在,当我执行代码时,出现以下错误->

raise ValueError(errmsg)ValueError: Some errors were detected !
    Line #4 (got 79 columns instead of 10)

Now I think that the difficulties come from the last column (val9) with the many ;;;;;;; 现在,我认为困难来自于最后一栏(val9),其中有很多;;;;;;;
It is obvious that the delimeters and the signs in the last column ; 很明显,最后一列中的分度符和符号; are the same! 是相同的!

How can I read in the file without an error, maybe there is a possibility to skip the last column, or to replace the ; 我如何能无错误地读取文件,也许有可能跳过最后一列,或替换; only in the last column? 仅在最后一列?

From the numpy documentation numpy文档

invalid_raise : bool, optional invalid_raise :布尔值,可选
If True, an exception is raised if an inconsistency is detected in the number of columns. 如果为True,则在列数中检测到不一致时引发异常。 If False, a warning is emitted and the offending lines are skipped. 如果为False,则发出警告,并且跳过有问题的行。

mydata = np.genfromtxt("test.txt", delimiter=";", names=col_names, dtype=types, invalid_raise = False)

Note that there were errors in your code which I have corrected (delimiter spelled incorrectly, and types list referred to as dtypes in function call) 请注意,您的代码中有我已纠正的错误(分隔符拼写错误,并且types列表在函数调用中称为dtypes

Edit : From your comment, I see I slightly misunderstood. 编辑 :从您的评论,我看到我有点误解了。 You meant that you want to skip the last column not the last row . 您的意思是要跳过最后一而不是最后一行

Take a look at the following code. 看一下下面的代码。 I have defined a generator that only returns the first ten elements of each row. 我定义了一个生成器,该生成器仅返回每行的前十个元素。 This will allow genfromtxt() to complete without error and you now get column #3 from all rows. 这将使genfromtxt()完成而不会出现错误,现在您将从所有行中获取第3列。

Note though, that you are still going to lose some data, as if you look carefully you will see that the problem line is actually two lines concatenated together with garbage where the other lines have ZERO . 但是请注意,您仍然会丢失一些数据,就像仔细看一样,您会发现问题行实际上是将两行与垃圾串联在一起,而其他行则为ZERO So you are still going to lose this second line. 因此,您仍然会丢失第二行。 You could maybe modify the generator to parse each line and deal with this differently, but I'll leave that as a fun exercise :) 您也许可以修改生成器以解析每一行并以不同的方式处理,但我将把它作为一个有趣的练习:)

import numpy as np

def filegen(filename):
    with open(filename, 'r') as infile:
        for line in infile:
            yield ';'.join(line.split(';')[:10])

col_names = ["date", "time", "rain_intensity", "weather_code_1", "radar_ref", "weather_code_2", "val6", "rain_accum", "val8", "val9"]
dtypes = ["object", "object", "float", "uint8", "float", "uint8", "uint8", "float", "uint8","|S10"]
# Read in the file with np.genfromtxt
mydata = np.genfromtxt(filegen('temp.txt'), delimiter=";", names=col_names, dtype = dtypes)

Output 输出量

[('01.06.2015', '00:00:00', 0.0, 0, -9.999, 0, 8, 0.0, 7, '(SPECTRUM)')
 ('01.06.2015', '00:01:00', 0.0, 0, -9.999, 0, 8, 0.0, 10, '(SPECTRUM)')
 ('01.06.2015', '00:02:00', 0.0, 0, -9.999, 0, 8, 0.0, 16, '(SPECTRUM)')
 ('01.06.2015', '09:23:00', 0.327, 61, 25.831, 39, 29, 0.18, 62, '01.06.2015')
 ('01.06.2015', '09:24:00', 0.0, 0, -9.999, 0, 29, 0.0, 66, '(SPECTRUM)')]

usecols can be used to ignore excess delimiters, eg usecols可用于忽略多余的定界符,例如

In [546]: np.genfromtxt([b'1,2,3',b'1,2,3,,,,,,'], dtype=None,
    delimiter=',', usecols=np.arange(3))
Out[546]: 
array([[1, 2, 3],
       [1, 2, 3]])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM