简体   繁体   English

如何使用numpy导入python中没有分隔符的文本文件?

[英]How do I import a text file with no separators in python, using numpy?

How do I import a file with no separators? 如何导入不带分隔符的文件?

I have a file named text.txt which contains 2 lines of text: 我有一个名为text.txt的文件,其中包含2行文本:

00000000011100000000000000000000 00000000011100000000000000000000
00000000011111110000000000000000 00000000011111110000000000000000

When I use 当我使用

f = open("text.txt") f =打开(“ text.txt”)
data = np.loadtxt(f) 数据= np.loadtxt(f)

I get 我懂了

[ 1.11000000e+22 1.11111100e+22] [1.11000000e + 22 1.11111100e + 22]

Using sep="" changes nothing. 使用sep=""不会改变任何内容。

I would like to get this result, in the form of many single digit integers: 我想以许多个数字整数的形式获得此结果:

[ [00000000011100000000000000000000] [[00000000011100000000000000000000]
[00000000011111110000000000000000] ] [00000000011111110000000000000000]

Any help is appreciated. 任何帮助表示赞赏。

UPDATE: Thank you all for the great answers and the many valid solutions to an awkward question. 更新:谢谢大家为一个尴尬的问题提供了很好的答案和许多有效的解决方案。

I'll take the statement "I would like to get this result, in the form of many single digit integers:" literally, and ignore the format of the sample that follows it (which appears to be just two integers, rather than many single digit integers). 从字面上看,我将接受语句“我想以许多个数字整数的形式获得此结果:”,并忽略其后的示例的格式(看起来只是两个整数,而不是许多整数数字整数)。 You can do that with genfromtxt by using the arguments delimiter=1 and dtype=int . 您可以使用genfromtxt通过使用参数delimiter=1dtype=int When delimiter is an integer or a sequence of integers, the values are interpreted as the field widths of a file containing fixed-width fields of data. delimiter是整数或整数序列时,这些值将解释为包含固定宽度数据字段的文件的字段宽度。

For example: 例如:

In [15]: genfromtxt('text.txt', delimiter=1, dtype=int)
Out[15]: 
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

If you don't give numpy any guidance, it has to guess the types you want. 如果您不给numpy任何指导,它必须猜测您想要的类型。

If your data look like decimal-format integers, it will try to interpret them that way and fit them into an int32. 如果您的数据看起来像十进制格式的整数,它将尝试以这种方式解释它们并将它们放入int32中。 But 00000000011100000000000000000000 (which is obviously equal to 11100000000000000000000 ) takes 74 bits, so that won't work. 但是00000000011100000000000000000000 (显然等于11100000000000000000000 )需要74位,所以不起作用。 So, it falls back to storing them in a float64 . 因此,它退回到将它们存储在float64

If you didn't realize that 1.11E22 means the same thing as 11100000000000000000000, you need to read up on scientific notation . 如果您没有意识到1.11E22与11100000000000000000000的含义相同,则需要阅读科学计数法 1.11E22 is Python (and C, and many other programming languages) shortcut for 1.11 * 10**22. 1.11E22是1.11 * 10 ** 22的Python(以及C和许多其他编程语言)快捷方式。 Anyway, the reason you're getting scientific notation is that the default printout for an array of float64 is %g -style, meaning something like "simple notation if -4 <= exponent < precision , otherwise exponential". 无论如何,您得到科学计数法的原因是float64数组的默认打印输出是%g -style,这意味着类似“ -4 <= exponent < precision简单计数法,否则为指数式”。

So, that's why you get [1.11000000e+22 1.11111100e+22] . 因此,这就是为什么您获得[1.11000000e+22 1.11111100e+22]


The reason you get an array of shape (2,) instead of (1, 2) is that by default, loadtxt squeezes mono-dimensional axes. 你得到形状的阵列的原因(2,)而不是(1, 2)是,默认情况下, loadtxt挤压单维的轴。 Add ndmin=2 if that's what you want. 如果您要添加ndmin=2


If you ask NumPy to treat the data as strings, it will guess the right length, and read them as strings: 如果您要求NumPy将数据视为字符串,它将猜测正确的长度,并将其读取为字符串:

>>> np.loadtxt(f, dtype=str, ndmin=2)
array([['00000000011100000000000000000000'],
       ['00000000011111110000000000000000']],
      dtype='|S32')

Or, if you ask it to treat the data as Python objects, it'll leave them as Python str objects: 或者,如果您要求它将数据视为Python对象,则将其保留为Python str对象:

>>> np.loadtxt(f, dtype=object, ndmin=2)
array([['00000000011100000000000000000000'],
       ['00000000011111110000000000000000']],
      dtype=object)

If you want them to be 128-bit integers… well, you probably don't have int128 support in your build, so you can't have that. 如果您希望它们是128位整数……那么,您的构建中可能没有int128支持,因此您就不可能拥有它。

If you were hoping for them to be interpreted as bit strings and stored in 32-bit ints, you have to do that in two steps. 如果希望将它们解释为位字符串并存储在32位int中,则必须分两步执行。 I don't think NumPy can vectorize parsing bit strings usefully, so you might as well do that part in Python: 我认为NumPy不能有效地向量化解析位字符串,因此您也可以在Python中做到这一点:

>>> np.fromiter((int(line, 2) for line in f), dtype=int)
array([7340032, 8323072])

If you want them interpreter as single-digit integers, there's no way to do that directly, but you can do that in two steps as well (eg, read it as an array of 2 strings, treat each string as a sequence of characters, broadcast np.vectorize(int) over it). 如果您希望将它们解释为一位整数,则无法直接执行此操作,但是您也可以分两个步骤进行操作(例如,将其读取为2个字符串的数组,将每个字符串视为一个字符序列,在其上广播np.vectorize(int) )。

Almost anything you want to do is doable, but you have to actually know what you want to do and be able to explain it to a human before you'll be able to explain it to numpy. 几乎所有您想做的事情都是可行的,但是您必须真正知道要做什么,并能够向人类解释它,然后才能向numpy解释它。

If I get you correctly, try the following: 如果我正确地找到了您,请尝试以下操作:

a = np.loadtxt('text.txt', dtype=np.character)
a = np.array(map(lambda x: map(int, x), a))

Output: 输出:

[[0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

This solution is a bit dumb and it defeats the use of np.loadtxt though, but sometimes we just want things to work. 这种解决方案有点笨,尽管它使np.loadtxt的使用失败了,但有时我们只是希望事情能够正常进行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM