從字符串列表創建一個numpy結構化數組

Question

我正在使用python實用程序從Tycho 2星級目錄中獲取數據。 我正在使用的功能之一查詢目錄並返回給定星號（或星號集）的所有信息。

我目前正在通過遍歷目錄文件的各行，然后嘗試將該行解析為一個numpy結構化數組（如果已查詢）來執行此操作。 （請注意，如果有更好的方法可以做到，即使這不是這個問題的意思，您也可以讓我知道-我這樣做是因為目錄太大，無法一次將所有內容加載到內存中時間）

無論如何，一旦我確定了要保留的記錄，便遇到了問題……我不知道如何將其解析為結構化數組。

例如，說我要保留的記錄是：

record = '0002 00038 1| |  3.64121230|  1.08701186|   14.1|  -23.0| 69| 82| 1.8| 1.9|1968.56|1957.30| 3|1.0|3.0|0.9|3.0|12.444|0.213|11.907|0.189|999| |         |  3.64117944|  1.08706861|1.83|1.73| 81.0|104.7| | 0.0'

現在，我試圖將其解析為具有dtype的numpy結構化數組：

        dform = [('starid', [('TYC1', int), ('TYC2', int), ('TYC3', int)]),
             ('pflag', str),
             ('starBearing', [('rightAscension', float), ('declination', float)]),
             ('properMotion', [('rightAscension', float), ('declination', float)]),
             ('uncertainty', [('rightAscension', int), ('declination', int), ('pmRA', float), ('pmDc', float)]),
             ('meanEpoch', [('rightAscension', float), ('declination', float)]),
             ('numPos', int),
             ('fitGoodness', [('rightAscension', float), ('declination', float), ('pmRA', float), ('pmDc', float)]),
             ('magnitude', [('BT', [('mag', float), ('err', float)]), ('VT', [('mag', float), ('err', float)])]),
             ('starProximity', int),
             ('tycho1flag', str),
             ('hipparcosNumber', str),
             ('observedPos', [('rightAscension', float), ('declination', float)]),
             ('observedEpoch', [('rightAscension', float), ('declination', float)]),
             ('observedError', [('rightAscension', float), ('declination', float)]),
             ('solutionType', str),
             ('correlation', float)]

看來這應該是一件相當簡單的事情，但是我嘗試的一切都中斷了...

我試過了：

np.genfromtxt(BytesIO(record.encode()),dtype=dform,delimiter=(' ','|'))
np.genfromtxt(BytesIO(record.encode()),dtype=dform,delimiter=(' ','|'),missing_values=' ',filling_values=None)

兩者都給我

{TypeError}cannot perform accumulate with flexible type

這是沒有意義的，因為它不應該進行任何累加。

我也嘗試過

np.array(re.split('\|| ',record),dtype=dform)

哪個抱怨

{TypeError}a bytes-like object is required, not 'str'

和另一個變體

np.array([x.encode() for x in re.split('\|| ',record)],dtype=dform)

這不會引發錯誤，但肯定不會返回正確的結果：

[ ((842018864, 0, 0), '', (0.0, 0.0), (0.0, 0.0), (0, 0, 0.0, 0.0), (0.0, 0.0), 0, (0.0, 0.0, 0.0, 0.0), ((0.0, 0.0), (0.0, 0.0)), 0, '', '', (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), '', 0.0)...

那我該怎么做呢？ 我認為genfromtxt選項是解決問題的方法（特別是因為有時可能會丟失數據），但是我不明白為什么它不起作用。 我只是需要自己編寫一個解析器嗎？

Answer 1

抱歉，這個答案漫長而漫不經心，但這就是找出正在發生的事情的原因。 dtype的復雜性尤其被其長度所掩蓋。

我收到TypeError: cannot perform accumulate with flexible type嘗試將您的delimiter列表時， TypeError: cannot perform accumulate with flexible type錯誤的TypeError: cannot perform accumulate with flexible type 。 詳細信息顯示在LineSplitter發生錯誤。 在不贅述的情況下，分隔符應為一個字符（或默認的“空白”）。

從genfromtxt文檔中：

delimiter：str，int或sequence，可選，用於分隔值的字符串。 默認情況下，任何連續的空格都用作分隔符。 也可以提供整數或整數序列作為每個字段的寬度。

genfromtxt拆分器比loadtxt使用的字符串.split強大一點，但不如re拆分器一般。

對於{TypeError}a bytes-like object is required, not 'str' ，您需要為幾個字段指定dtype'str 'str' 。 那是字節字符串，在您的record是unicode字符串（在Py3中）。 但是您已經通過BytesIO(record.encode())意識到了這BytesIO(record.encode()) 。

我喜歡用以下方法測試genfromtxt案例：

record = b'....'
np.genfromtxt([record], ....)

還是更好

records = b"""one line
tow line
three line
"""
np.genfromtxt(records.splitlines(), ....)

如果我讓genfromtxt推斷字段類型，而僅使用一個定界符，我將得到32個字段：

In [19]: A=np.genfromtxt([record],dtype=None,delimiter='|')
In [20]: len(A.dtype)
Out[20]: 32
In [21]: A
Out[21]: 
array((b'0002 00038 1', False, 3.6412123, 1.08701186, 14.1, -23.0, 69, 82, 1.8, 1.9, 1968.56, 1957.3, 3, 1.0, 3.0, 0.9, 3.0, 12.444, 0.213, 11.907, 0.189, 999, False, False, 3.64117944, 1.08706861, 1.83, 1.73, 81.0, 104.7, False, 0.0), 
      dtype=[('f0', 'S12'), ('f1', '?'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8'), ... ('f26', '<f8'), ('f27', '<f8'), ('f28', '<f8'), ('f29', '<f8'), ('f30', '?'), ('f31', '<f8')])

當我們得到整個字節和定界符時，問題就解決了

np.array([x for x in re.split(b'\|| ',record)],dtype=dform)

確實運行。 現在，我看到您的dform很復雜，帶有嵌套的復合字段。

但是要定義結構化數組，您可以給它一個記錄列表，例如

np.array([(record1...), (record2...), ....], dtype([(field1),(field2 ),...]))

在這里，您嘗試創建一個記錄。 我可以將您的列表包裝到一個元組中，但隨后該長度與dform長度不匹配，即66 dform 。如果您計算dform所有子字段可能需要66個值，但是我們不能只使用一個元組來做到這一點。

我從未嘗試dtype如此復雜的dtype創建數組，所以我在尋找使它起作用的方法。

In [41]: np.zeros((1,),dform)
Out[41]: 
array([ ((0, 0, 0), '', (0.0, 0.0), (0.0, 0.0), (0, 0, 0.0, 0.0), (0.0, 0.0), 0, (0.0, 0.0, 0.0, 0.0), ((0.0, 0.0), (0.0, 0.0)), 0, '', '', (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), '', 0.0)], 
      dtype=[('starid', [('TYC1', '<i4'), ('TYC2', '<i4'), ('TYC3', '<i4')]), ('pflag', '<U'), ('starBearing', [('rightAscension', '<f8'), ('declination', '<f8')]), ('properMotion', [('rightAscension', '<f8'), ('declination', '<f8')]), ('uncertainty', [('rightAscension', '<i4'), ('declination', '<i4'), ('pmRA', '<f8'), ('pmDc', '<f8')]), ('meanEpoch', ....('solutionType', '<U'), ('correlation', '<f8')])

In [64]: for name in A.dtype.names:
    print(A[name].dtype)
   ....:     
[('TYC1', '<i4'), ('TYC2', '<i4'), ('TYC3', '<i4')]
<U1
[('rightAscension', '<f8'), ('declination', '<f8')]
[('rightAscension', '<f8'), ('declination', '<f8')]
[('rightAscension', '<i4'), ('declination', '<i4'), ('pmRA', '<f8'), ('pmDc', '<f8')]
[('rightAscension', '<f8'), ('declination', '<f8')]
int32
[('rightAscension', '<f8'), ('declination', '<f8'), ('pmRA', '<f8'), ('pmDc', '<f8')]
[('BT', [('mag', '<f8'), ('err', '<f8')]), ('VT', [('mag', '<f8'), ('err', '<f8')])]
int32
<U1
<U1
[('rightAscension', '<f8'), ('declination', '<f8')]
[('rightAscension', '<f8'), ('declination', '<f8')]
[('rightAscension', '<f8'), ('declination', '<f8')]
<U1
float64

我數了34個原始dtype字段。 大多數是“標量”，大約2-4個術語，其中一個具有進一步的嵌套層次。

如果我用|替換前兩個分割空間 ， record.split(b'|')給了我34個字符串。

讓我們在genfromtxt嘗試genfromtxt ：

In [79]: np.genfromtxt([record],delimiter='|',dtype=dform)
Out[79]: 
array(((2, 38, 1), '', (3.6412123, 1.08701186), (14.1, -23.0), 
   (69, 82, 1.8, 1.9), (1968.56, 1957.3), 3, (1.0, 3.0, 0.9, 3.0),
   ((12.444, 0.213), (11.907, 0.189)), 999, '', '', 
   (3.64117944, 1.08706861), (1.83, 1.73), (81.0, 104.7), '', 0.0), 
      dtype=[('starid', [('TYC1', '<i4'), ('TYC2', '<i4'), ('TYC3', '<i4')]), 
 ('pflag', '<U'), 
 ('starBearing', [('rightAscension', '<f8'), ('declination', '<f8')]),  
 ('properMotion', [('rightAscension', '<f8'), ('declination', '<f8')]),
 ('uncertainty', [('rightAscension', '<i4'), ('declination', '<i4'), ('pmRA', '<f8'), ('pmDc', '<f8')]), 
 ('meanEpoch', [('rightAscension', '<f8'), ('declination', '<f8')]),   
 ('numPos', '<i4'), 
 ('fitGoodness', [('rightAscension', '<f8'), ('declination', '<f8'), ('pmRA', '<f8'), ('pmDc', '<f8')]), 
 ('magnitude', [('BT', [('mag', '<f8'), ('err', '<f8')]), ('VT', [('mag', '<f8'), ('err', '<f8')])]), 
 ('starProximity', '<i4'), ('tycho1flag', '<U'), ('hipparcosNumber', '<U'), 
 ('observedPos', [('rightAscension', '<f8'), ('declination', '<f8')]),
 ('observedEpoch', [('rightAscension', '<f8'), ('declination', '<f8')]), 
 ('observedError', [('rightAscension', '<f8'), ('declination', '<f8')]), ('solutionType', '<U'), ('correlation', '<f8')])

看起來幾乎是合理的。 genfromtxt實際上可以在復合字段之間分配值。 這比我想使用np.array()嘗試的更多。

因此，如果您確定了定界符並計算出字節/ Unicode，則genfromtxt可以處理此問題。

從字符串列表創建一個numpy結構化數組

問題描述

1 個解決方案

解決方案1
3 已采納 2015-12-22 01:17:04

從字符串列表創建一個numpy結構化數組

問題描述

1 個解決方案

解決方案1 3 已采納 2015-12-22 01:17:04

解決方案1
3 已采納 2015-12-22 01:17:04