[英]How to load data from a txt file with variable number of columns in each row into a numpy array?
My data file looks like this: 我的数据文件如下所示:
I want to load this data into a numpy array. 我想将此数据加载到一个numpy数组中。 How do I do that?
我怎么做?
If I use loadtxt(filename)
, it gives the error: 如果我使用
loadtxt(filename)
,则会出现错误:
raise ValueError(errmsg)
ValueError: Some errors were detected !
If I use genfromtxt(filename, delimiter=" ")
, it gives the same error, even though this was supposed to fix it. 如果我使用
genfromtxt(filename, delimiter=" ")
,即使给出了解决方法,它也会产生相同的错误。
If I use the following: 如果我使用以下内容:
from array import array
N=84 # max number of columns in any row in the data file
with open('C:/Users/hp1/Desktop/ClusterAnalysis/hierarchical_result.txt',"r") as f:
all_data=[x.split() for x in f.readlines()]
a=array([map(int,x) for x in all_data[:N]])
I get this error: 我收到此错误:
TypeError: array() argument 1 must be a unicode character, not list
EDIT: This is all of the data in the data file: 编辑:这是数据文件中的所有数据:
61 81
2 28
13 31
59 64
36 63
45 58
3 73
47 51
33 68
1 72
12 84
3 73 12 84
1 72 3 73 12 84
6 83
27 42
66 6 83
54 77
60 54 77
39 40
10 19
49 79
22 76
61 81 60 54 77
65 61 81 60 54 77
8 65 61 81 60 54 77
66 6 83 8 65 61 81 60 54 77
71 47 51
18 25
59 64 18 25
32 59 64 18 25
11 34
20 26
27 42 20 26
69 27 42 20 26
16 62
43 16 62
30 45 58
85 30 45 58
56 85 30 45 58
17 11 34
22 76 32 59 64 18 25
29 39 40
14 57
44 14 57
7 24
78 2 28
15 37
70 15 37
48 70 15 37
80 29 39 40
4 9
75 43 16 62
13 31 75 43 16 62
74 13 31 75 43 16 62
36 63 17 11 34
53 36 63 17 11 34
46 1 72 3 73 12 84
23 52
38 66 6 83 8 65 61 81 60 54 77
82 38 66 6 83 8 65 61 81 60 54 77
10 19 56 85 30 45 58
33 68 10 19 56 85 30 45 58
5 49 79
78 2 28 4 9
55 80 29 39 40
67 55 80 29 39 40
7 24 67 55 80 29 39 40
35 48 70 15 37
69 27 42 20 26 35 48 70 15 37
41 82 38 66 6 83 8 65 61 81 60 54 77
50 69 27 42 20 26 35 48 70 15 37
33 68 10 19 56 85 30 45 58 41 82 38 66 6 83 8 65 61 81 60 54 77
7 24 67 55 80 29 39 40 50 69 27 42 20 26 35 48 70 15 37
46 1 72 3 73 12 84 33 68 10 19 56 85 30 45 58 41 82 38 66 6 83 8 65 61 81 60 54 77
22 76 32 59 64 18 25 46 1 72 3 73 12 84 33 68 10 19 56 85 30 45 58 41 82 38 66 6 83 8 65 61 81 60 54 77
7 24 67 55 80 29 39 40 50 69 27 42 20 26 35 48 70 15 37 22 76 32 59 64 18 25 46 1 72 3 73 12 84 33 68 10 19 56 85 30 45 58 41 82 38 66 6 83 8 65 61 81 60 54 77
53 36 63 17 11 34 7 24 67 55 80 29 39 40 50 69 27 42 20 26 35 48 70 15 37 22 76 32 59 64 18 25 46 1 72 3 73 12 84 33 68 10 19 56 85 30 45 58 41 82 38 66 6 83 8 65 61 81 60 54 77
78 2 28 4 9 53 36 63 17 11 34 7 24 67 55 80 29 39 40 50 69 27 42 20 26 35 48 70 15 37 22 76 32 59 64 18 25 46 1 72 3 73 12 84 33 68 10 19 56 85 30 45 58 41 82 38 66 6 83 8 65 61 81 60 54 77
74 13 31 75 43 16 62 78 2 28 4 9 53 36 63 17 11 34 7 24 67 55 80 29 39 40 50 69 27 42 20 26 35 48 70 15 37 22 76 32 59 64 18 25 46 1 72 3 73 12 84 33 68 10 19 56 85 30 45 58 41 82 38 66 6 83 8 65 61 81 60 54 77
44 14 57 74 13 31 75 43 16 62 78 2 28 4 9 53 36 63 17 11 34 7 24 67 55 80 29 39 40 50 69 27 42 20 26 35 48 70 15 37 22 76 32 59 64 18 25 46 1 72 3 73 12 84 33 68 10 19 56 85 30 45 58 41 82 38 66 6 83 8 65 61 81 60 54 77
5 49 79 44 14 57 74 13 31 75 43 16 62 78 2 28 4 9 53 36 63 17 11 34 7 24 67 55 80 29 39 40 50 69 27 42 20 26 35 48 70 15 37 22 76 32 59 64 18 25 46 1 72 3 73 12 84 33 68 10 19 56 85 30 45 58 41 82 38 66 6 83 8 65 61 81 60 54 77
71 47 51 5 49 79 44 14 57 74 13 31 75 43 16 62 78 2 28 4 9 53 36 63 17 11 34 7 24 67 55 80 29 39 40 50 69 27 42 20 26 35 48 70 15 37 22 76 32 59 64 18 25 46 1 72 3 73 12 84 33 68 10 19 56 85 30 45 58 41 82 38 66 6 83 8 65 61 81 60 54 77
23 52 71 47 51 5 49 79 44 14 57 74 13 31 75 43 16 62 78 2 28 4 9 53 36 63 17 11 34 7 24 67 55 80 29 39 40 50 69 27 42 20 26 35 48 70 15 37 22 76 32 59 64 18 25 46 1 72 3 73 12 84 33 68 10 19 56 85 30 45 58 41 82 38 66 6 83 8 65 61 81 60 54 77
21 23 52 71 47 51 5 49 79 44 14 57 74 13 31 75 43 16 62 78 2 28 4 9 53 36 63 17 11 34 7 24 67 55 80 29 39 40 50 69 27 42 20 26 35 48 70 15 37 22 76 32 59 64 18 25 46 1 72 3 73 12 84 33 68 10 19 56 85 30 45 58 41 82 38 66 6 83 8 65 61 81 60 54 77
numpy.genfromtxt
does not handle variable-length rows. numpy.genfromtxt
不处理可变长度的行。 You should parse you txt
by yourself. 您应该自己解析
txt
。
No need to use array
as following in Python 3.x
无需在
Python 3.x
使用array
,如下所示
import numpy as np
N = 84 # max number of columns in any row in the data file
with open('C:/Users/hp1/Desktop/ClusterAnalysis/hierarchical_result.txt',"r") as f:
all_data = [x.split() for x in f.readlines()]
output = np.array([list(map(int,x))[:N] for x in all_data])
In [306]: with open('stack44755004.txt') as f:
...: lines = f.readlines()
...:
In [307]: strs = [line.split() for line in lines]
In [308]: strs
Out[308]: [['61', '81'], ['2', '28'], ['13', '31'], ['3', '73', '12', '84'], ['6', '83']]
In [309]: nums = [[int(i) for i in line.split()]for line in lines]
In [310]: nums
Out[310]: [[61, 81], [2, 28], [13, 31], [3, 73, 12, 84], [6, 83]]
nums
is a list of lists of numbers. nums
是数字列表的列表。 Can't make that into a 2d array of numbers. 无法将其转化为二维数字数组。
But with a plain read
I get a string with newlines: 但是,通过简单的
read
我得到了带有换行符的字符串:
In [311]: with open('stack44755004.txt') as f:
...: alldata = f.read()
In [312]: alldata
Out[312]: '61 81\n2 28\n13 31\n3 73 12 84\n6 83\n'
split
treats that like space, so I get a list of strings: split
将其视为空格,因此我得到了一个字符串列表:
In [313]: alldata.split()
Out[313]: ['61', '81', '2', '28', '13', '31', '3', '73', '12', '84', '6', '83']
np.array
can convert that to an array of integers np.array
可以将其转换为整数数组
In [314]: np.array(alldata.split(),int)
Out[314]: array([61, 81, 2, 28, 13, 31, 3, 73, 12, 84, 6, 83])
This method looses all the line information. 此方法会丢失所有行信息。 Is that important?
那重要吗?
There are ways of turning nums
into an array. 有转弯的方式
nums
到一个数组。 For example it could be written into a zero padded array. 例如,可以将其写入零填充数组。 But if you don't know what you want, I'm not sure that's worth the trouble.
但是,如果您不知道想要什么,那么不确定是否值得这样做。
There have been various padding
questions. 存在各种
padding
问题。 One tool that I recall off to top of my head is itertools.zip_longest
(Python3 version): 我最想起的一种工具是
itertools.zip_longest
(Python3版本):
Out[317]: <itertools.zip_longest at 0xa9c46194>
In [318]: list(itertools.zip_longest(*nums, fillvalue=0))
Out[318]: [(61, 2, 13, 3, 6), (81, 28, 31, 73, 83), (0, 0, 0, 12, 0), (0, 0, 0, 84, 0)]
In [319]: np.array(_)
Out[319]:
array([[61, 2, 13, 3, 6],
[81, 28, 31, 73, 83],
[ 0, 0, 0, 12, 0],
[ 0, 0, 0, 84, 0]])
In [320]: _.T
Out[320]:
array([[61, 81, 0, 0],
[ 2, 28, 0, 0],
[13, 31, 0, 0],
[ 3, 73, 12, 84],
[ 6, 83, 0, 0]])
I have used pandas for that problem, where you can specify the desired columns. 我已经用熊猫解决了这个问题,您可以在其中指定所需的列。 If a columns has fewer columns, they will be set to NaN.
如果一列中的列较少,则将它们设置为NaN。 You have to know the maximum number of columns, but that is easily detected using readlines, split and a list comprehension.
您必须知道最大列数,但是使用读取行,拆分和列表理解很容易检测到。
If you want to pad each row with the max number of columns, you have to implement it yourself. 如果要用最大列数填充每一行,则必须自己实现。 Something to the effect:
效果:
import numpy as np
def pad_list(lst, padding, default=0):
return lst + (padding - len(lst))*[default]
N = 84 # max number of columns in any row in the data file
with open('/path/to/file',"r") as f:
all_data=(map(int, x.split()) for x in f)
a = np.array([pad_list(list(x), N) for x in all_data])
However, for this give you a numeric instead of object type array, you need to know the actual maximum number of columns. 但是,为了给您一个数字而不是对象类型的数组,您需要知道实际的最大列数。 So be careful with figuring that out.
因此,请小心谨慎。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.