简体   繁体   English

如何在python numpy.bytes_类型上使用split()? (从文件中读取字典)

[英]how to use split() on python numpy.bytes_ type? (read dictionary from file)

I want to read data from a (very large, whitespace separated, two-column) text file into a Python dictionary. 我想从一个(非常大的,空白分隔的,两列)文本文件中读取数据到Python字典中。 I tried to do this with a for-loop but that was too slow. 我尝试使用for循环执行此操作,但这太慢了。 MUCH faster is reading it with numpy loadtxt into a struct array and then converting it to a dictionary: 用numpy loadtxt读取它到结构数组然后将其转换为字典会更快!

data = np.loadtxt('filename.txt', dtype=[('field1', 'a20'), ('field2', int)], ndmin=1)
result = dict(data)

But this is surely not the best way? 但这肯定不是最好的方法吗? Any advice? 有什么建议?

The main reason I need something else, is that the following does not work: 我需要别的东西的主要原因是以下不起作用:

data[0]['field1'].split(sep='-')

It leads to the error message: 它会导致错误消息:

TypeError: Type str doesn't support the buffer API

If the split() method exists, why can't I use it? 如果存在split()方法,为什么我不能使用它? Should I use a different dtype? 我应该使用不同的dtype吗? Or is there a different (fast) way to read the text file? 或者是否有不同的(快速)方式来读取文本文件? Is there anything else I am missing? 还有什么我想念的吗?

Versions: python version 3.3.2 numpy version 1.7.1 版本:python版本3.3.2 numpy版本1.7.1

Edit: changed data['field1'].split(sep='-') to data[0]['field1'].split(sep='-') 编辑:更改data['field1'].split(sep='-')data[0]['field1'].split(sep='-')

The standard library split returns a variable number of arguments, depending on how many times the separator is found in the string, and is therefore not very suitable for array operations. 标准库split返回可变数量的参数,具体取决于在字符串中找到分隔符的次数,因此不太适合数组操作。 My char numpy arrays (I'm running 1.7) do not have a split method, by the way. 顺便说一下,我的char numpy数组(我正在运行1.7)没有split方法。

You do have np.core.defchararray.partition , which is similar but poses no problems for vectorization, as well as all the other string operations : 你有np.core.defchararray.partition ,它类似但对矢量化没有任何问题,以及所有其他字符串操作

>>> a = np.array(['a - b', 'c - d', 'e - f'], dtype=np.string_)
>>> a
array(['a - b', 'c - d', 'e - f'], 
      dtype='|S5')
>>> np.core.defchararray.partition(a, '-')
array([['a ', '-', ' b'],
       ['c ', '-', ' d'],
       ['e ', '-', ' f']], 
      dtype='|S2')

Because: type(data[0]['field1']) gives <class 'numpy.bytes_'> , the split() method does not work when it has a "normal" string as argument (is this a bug?) 因为: type(data[0]['field1'])给出<class 'numpy.bytes_'> ,当它有一个“普通”字符串作为参数时, split()方法不起作用(这是一个bug?)

the way I solved it: data[0]['field1'].split(sep=b'-') (the key to this is to put the b in front of '-') 我解决它的方式: data[0]['field1'].split(sep=b'-') b'-')(关键是将b放在' - '前面)

And of course Jaime's suggestion to use the following was very helpful: np.core.defchararray.partition(a, '-') but also in this case b'-' is needed to make it work. 当然,Jaime建议使用以下内容非常有用: np.core.defchararray.partition(a, '-')但在这种情况下,需要使用b'-'才能使其正常工作。

In fact, a related question was answered here: Type str doesn't support the buffer API although at first sight I did not realise this was the same issue. 事实上,这里回答了一个相关的问题: 类型str不支持缓冲区API,虽然乍一看我没有意识到这是同一个问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM