简体   繁体   English

从字符串列表创建一个numpy结构化数组

[英]create a numpy structured array from a list of strings

I am working on a python utility to get data from the Tycho 2 star catalogue. 我正在使用python实用程序从Tycho 2星级目录中获取数据。 One of the functions I am working on queries the catalogue and returns all the information for a given star id (or set of star ids). 我正在使用的功能之一查询目录并返回给定星号(或星号集)的所有信息。

I'm currently doing this by looping through the lines of the catalogue file and then attempting to parse the line into a numpy structured array if it was queried. 我目前正在通过遍历目录文件的各行,然后尝试将该行解析为一个numpy结构化数组(如果已查询)来执行此操作。 (note if there is a better way to do this you can let me know even though this is not what this question is about -- I'm doing it this way because the catalogue is too big to load all of it into memory at one time) (请注意,如果有更好的方法可以做到,即使这不是这个问题的意思,您也可以让我知道-我这样做是因为目录太大,无法一次将所有内容加载到内存中时间)

Anyway, once I have identified a record that I want to keep I've run into a problem... I can't figure out how to parse it into a structured array. 无论如何,一旦我确定了要保留的记录,便遇到了问题……我不知道如何将其解析为结构化数组。

For instance, say the record I want to keep is: 例如,说我要保留的记录是:

record = '0002 00038 1| |  3.64121230|  1.08701186|   14.1|  -23.0| 69| 82| 1.8| 1.9|1968.56|1957.30| 3|1.0|3.0|0.9|3.0|12.444|0.213|11.907|0.189|999| |         |  3.64117944|  1.08706861|1.83|1.73| 81.0|104.7| | 0.0'

Now, I am trying to parse this into a numpy structured array with dtype: 现在,我试图将其解析为具有dtype的numpy结构化数组:

        dform = [('starid', [('TYC1', int), ('TYC2', int), ('TYC3', int)]),
             ('pflag', str),
             ('starBearing', [('rightAscension', float), ('declination', float)]),
             ('properMotion', [('rightAscension', float), ('declination', float)]),
             ('uncertainty', [('rightAscension', int), ('declination', int), ('pmRA', float), ('pmDc', float)]),
             ('meanEpoch', [('rightAscension', float), ('declination', float)]),
             ('numPos', int),
             ('fitGoodness', [('rightAscension', float), ('declination', float), ('pmRA', float), ('pmDc', float)]),
             ('magnitude', [('BT', [('mag', float), ('err', float)]), ('VT', [('mag', float), ('err', float)])]),
             ('starProximity', int),
             ('tycho1flag', str),
             ('hipparcosNumber', str),
             ('observedPos', [('rightAscension', float), ('declination', float)]),
             ('observedEpoch', [('rightAscension', float), ('declination', float)]),
             ('observedError', [('rightAscension', float), ('declination', float)]),
             ('solutionType', str),
             ('correlation', float)]

This seems like it should be a fairly simple thing to do but everything I try breaks... 看来这应该是一件相当简单的事情,但是我尝试的一切都中断了...

I've tried: 我试过了:

np.genfromtxt(BytesIO(record.encode()),dtype=dform,delimiter=(' ','|'))
np.genfromtxt(BytesIO(record.encode()),dtype=dform,delimiter=(' ','|'),missing_values=' ',filling_values=None)

both of which gives me 两者都给我

{TypeError}cannot perform accumulate with flexible type

which makes no sense since it shouldn't be doing any accumulation. 这是没有意义的,因为它不应该进行任何累加。

I've also tried 我也尝试过

np.array(re.split('\|| ',record),dtype=dform)

which complains 哪个抱怨

{TypeError}a bytes-like object is required, not 'str'

and another variant 和另一个变体

np.array([x.encode() for x in re.split('\|| ',record)],dtype=dform)

which doesn't throw an error but also certainly doesn't return the correct results: 这不会引发错误,但肯定不会返回正确的结果:

[ ((842018864, 0, 0), '', (0.0, 0.0), (0.0, 0.0), (0, 0, 0.0, 0.0), (0.0, 0.0), 0, (0.0, 0.0, 0.0, 0.0), ((0.0, 0.0), (0.0, 0.0)), 0, '', '', (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), '', 0.0)...

So how can I do this? 那我该怎么做呢? I think the genfromtxt option is the way to go (especially since there may be missing data occasionally) but I don't understand why it isn't working. 我认为genfromtxt选项是解决问题的方法(特别是因为有时可能会丢失数据),但是我不明白为什么它不起作用。 Is this something that I'm just going to have to write a parser for on my own? 我只是需要自己编写一个解析器吗?

Sorry, this answer is long and rambling, but that's what it took to figure out what is going on. 抱歉,这个答案漫长而漫不经心,但这就是找出正在发生的事情的原因。 The complexity of the dtype in particular was hidden by its length. dtype的复杂性尤其被其长度所掩盖。


I get the TypeError: cannot perform accumulate with flexible type error when I try your list for delimiter . 我收到TypeError: cannot perform accumulate with flexible type尝试将您的delimiter列表时, TypeError: cannot perform accumulate with flexible type错误的TypeError: cannot perform accumulate with flexible type The details show the error occurs in LineSplitter . 详细信息显示在LineSplitter发生错误。 Without getting into details, the delimiter should be one character (or the default 'whitespace'). 在不赘述的情况下,分隔符应为一个字符(或默认的“空白”)。

From the genfromtxt docs: genfromtxt文档中:

delimiter : str, int, or sequence, optional The string used to separate values. delimiter:str,int或sequence,可选,用于分隔值的字符串。 By default, any consecutive whitespaces act as delimiter. 默认情况下,任何连续的空格都用作分隔符。 An integer or sequence of integers can also be provided as width(s) of each field. 也可以提供整数或整数序列作为每个字段的宽度。

The genfromtxt splitter is a little more powerful than the string .split that loadtxt uses, but not as general as the re splitter. genfromtxt拆分器比loadtxt使用的字符串.split强大一点,但不如re拆分器一般。

As for the {TypeError}a bytes-like object is required, not 'str' , you specify, for a couple of the fields, dtype 'str' . 对于{TypeError}a bytes-like object is required, not 'str' ,您需要为几个字段指定dtype'str 'str' That's byte string, where as your record is unicode string (in Py3). 那是字节字符串,在您的record是unicode字符串(在Py3中)。 But you've already realized that with BytesIO(record.encode()) . 但是您已经通过BytesIO(record.encode())意识到了这BytesIO(record.encode())

I like to test genfromtxt cases with: 我喜欢用以下方法测试genfromtxt案例:

record = b'....'
np.genfromtxt([record], ....)

Or better yet 还是更好

records = b"""one line
tow line
three line
"""
np.genfromtxt(records.splitlines(), ....)

If I let genfromtxt deduce field types, and just use the one delimiter, I get 32 fields: 如果我让genfromtxt推断字段类型,而仅使用一个定界符,我将得到32个字段:

In [19]: A=np.genfromtxt([record],dtype=None,delimiter='|')
In [20]: len(A.dtype)
Out[20]: 32
In [21]: A
Out[21]: 
array((b'0002 00038 1', False, 3.6412123, 1.08701186, 14.1, -23.0, 69, 82, 1.8, 1.9, 1968.56, 1957.3, 3, 1.0, 3.0, 0.9, 3.0, 12.444, 0.213, 11.907, 0.189, 999, False, False, 3.64117944, 1.08706861, 1.83, 1.73, 81.0, 104.7, False, 0.0), 
      dtype=[('f0', 'S12'), ('f1', '?'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8'), ... ('f26', '<f8'), ('f27', '<f8'), ('f28', '<f8'), ('f29', '<f8'), ('f30', '?'), ('f31', '<f8')])

When we get the whole byte and delimiter issues worked out 当我们得到整个字节和定界符时,问题就解决了

np.array([x for x in re.split(b'\|| ',record)],dtype=dform)

does run. 确实运行。 I now see that your dform is complex, with nested compound fields. 现在,我看到您的dform很复杂,带有嵌套的复合字段。

But to define a structured array, you to give it a list of records, eg 但是要定义结构化数组,您可以给它一个记录列表,例如

np.array([(record1...), (record2...), ....], dtype([(field1),(field2 ),...]))

Here you are trying to create one record. 在这里,您尝试创建一个记录。 I could wrap your list in a tuple, but then I get a mismatch between that length and dform length, 66 v 17. If you count all the subfields dform might take 66 values, but we can't just do that with one tuple. 我可以将您的列表包装到一个元组中,但随后该长度与dform长度不匹配,即66 dform 。如果您计算dform所有子字段可能需要66个值,但是我们不能只使用一个元组来做到这一点。

I've never tried to create an array from such a complex dtype , so I'm fishing around for ways to make it work. 我从未尝试dtype如此复杂的dtype创建数组,所以我在寻找使它起作用的方法。

In [41]: np.zeros((1,),dform)
Out[41]: 
array([ ((0, 0, 0), '', (0.0, 0.0), (0.0, 0.0), (0, 0, 0.0, 0.0), (0.0, 0.0), 0, (0.0, 0.0, 0.0, 0.0), ((0.0, 0.0), (0.0, 0.0)), 0, '', '', (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), '', 0.0)], 
      dtype=[('starid', [('TYC1', '<i4'), ('TYC2', '<i4'), ('TYC3', '<i4')]), ('pflag', '<U'), ('starBearing', [('rightAscension', '<f8'), ('declination', '<f8')]), ('properMotion', [('rightAscension', '<f8'), ('declination', '<f8')]), ('uncertainty', [('rightAscension', '<i4'), ('declination', '<i4'), ('pmRA', '<f8'), ('pmDc', '<f8')]), ('meanEpoch', ....('solutionType', '<U'), ('correlation', '<f8')])

In [64]: for name in A.dtype.names:
    print(A[name].dtype)
   ....:     
[('TYC1', '<i4'), ('TYC2', '<i4'), ('TYC3', '<i4')]
<U1
[('rightAscension', '<f8'), ('declination', '<f8')]
[('rightAscension', '<f8'), ('declination', '<f8')]
[('rightAscension', '<i4'), ('declination', '<i4'), ('pmRA', '<f8'), ('pmDc', '<f8')]
[('rightAscension', '<f8'), ('declination', '<f8')]
int32
[('rightAscension', '<f8'), ('declination', '<f8'), ('pmRA', '<f8'), ('pmDc', '<f8')]
[('BT', [('mag', '<f8'), ('err', '<f8')]), ('VT', [('mag', '<f8'), ('err', '<f8')])]
int32
<U1
<U1
[('rightAscension', '<f8'), ('declination', '<f8')]
[('rightAscension', '<f8'), ('declination', '<f8')]
[('rightAscension', '<f8'), ('declination', '<f8')]
<U1
float64

I count 34 primitive dtype fields. 我数了34个原始dtype字段。 Most are 'scalar', some 2-4 terms, one has a further level of nesting. 大多数是“标量”,大约2-4个术语,其中一个具有进一步的嵌套层次。

If I replace the first 2 spliting spaces with | 如果我用|替换前两个分割空间 , record.split(b'|') gives me 34 strings. record.split(b'|')给了我34个字符串。

Lets try that in genfromtxt : 让我们在genfromtxt尝试genfromtxt

In [79]: np.genfromtxt([record],delimiter='|',dtype=dform)
Out[79]: 
array(((2, 38, 1), '', (3.6412123, 1.08701186), (14.1, -23.0), 
   (69, 82, 1.8, 1.9), (1968.56, 1957.3), 3, (1.0, 3.0, 0.9, 3.0),
   ((12.444, 0.213), (11.907, 0.189)), 999, '', '', 
   (3.64117944, 1.08706861), (1.83, 1.73), (81.0, 104.7), '', 0.0), 
      dtype=[('starid', [('TYC1', '<i4'), ('TYC2', '<i4'), ('TYC3', '<i4')]), 
 ('pflag', '<U'), 
 ('starBearing', [('rightAscension', '<f8'), ('declination', '<f8')]),  
 ('properMotion', [('rightAscension', '<f8'), ('declination', '<f8')]),
 ('uncertainty', [('rightAscension', '<i4'), ('declination', '<i4'), ('pmRA', '<f8'), ('pmDc', '<f8')]), 
 ('meanEpoch', [('rightAscension', '<f8'), ('declination', '<f8')]),   
 ('numPos', '<i4'), 
 ('fitGoodness', [('rightAscension', '<f8'), ('declination', '<f8'), ('pmRA', '<f8'), ('pmDc', '<f8')]), 
 ('magnitude', [('BT', [('mag', '<f8'), ('err', '<f8')]), ('VT', [('mag', '<f8'), ('err', '<f8')])]), 
 ('starProximity', '<i4'), ('tycho1flag', '<U'), ('hipparcosNumber', '<U'), 
 ('observedPos', [('rightAscension', '<f8'), ('declination', '<f8')]),
 ('observedEpoch', [('rightAscension', '<f8'), ('declination', '<f8')]), 
 ('observedError', [('rightAscension', '<f8'), ('declination', '<f8')]), ('solutionType', '<U'), ('correlation', '<f8')])

That almost looks reasonable. 看起来几乎是合理的。 genfromtxt can actually split the values up among the compound fields. genfromtxt实际上可以在复合字段之间分配值。 That's more that what I'd want to try with np.array() . 这比我想使用np.array()尝试的更多。

So if you get the delimiters and byte/unicode worked out, genfromtxt can handle this mess. 因此,如果您确定了定界符并计算出字节/ Unicode,则genfromtxt可以处理此问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM