NumPy thinks a 2-D array is 1-D

Question

I have a NumPy array that is constructed from a text file. I've been doing things this way for weeks and never seen this problem before.

print data
print data[:, 1:]

outputs

[['1', '200', '300', '400', '500\n']
 ['3', '500', '400', '200', '1000\n']
 ['14', '900', '200', '300', '100\n'] ...,
 ['999142', '24', '21', '20', '12\n']]
Traceback (most recent call last):
File ...., line ..., in ....
print data[:, 1:]
IndexError:  too many indices

Why is this happening and how can I fix it?

Edit: Big clue. data.shape is (3313869,) with no second value.

data.ndim is 1 .

len(data[1]) , however, is 5.

Edit, I am constructing it with

data = [re.split(' ', line) for line in f]
f.close()
data = np.array(data)

When I interject

f.close()
print data[0:10]

It gives ie

[['1', '200', '300', '400', '500\\n'], ['3', .... ]]

Answer 1

The problem happened because your code is somehow creating a numpy.array of objects. See this question with a similar issue. When it happens you get something like:

a = numpyp.array([list1, list2, list3, ... , listn], dtype=object)

It is a 1D array, but when you ask to print it will call the __str__ of each list inside, giving:

[[ 1, 2, 3, 4],
 [ 5, 6, 7, 8]]

which seems like a 2D array.

You can simulate it doing:

a = ['aaa' for i in range(10)]
b = numpy.empty((5),dtype=object)
b.fill(a)

lets check b :

b.shape # (5,)
b.ndim  # 1

but print b gives:

[['aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa']
 ['aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa']
 ['aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa']
 ['aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa']
 ['aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa', 'aaa']]

Quite tricky...

Answer 2

I solved this with

for line in data:
          if (len(line) != 5):
                  print len(line)
                  print line

A few of the lines in my data had spaces at the end, which was leading to 500 and \\n being separated into separate tokens. This snuck in because on Friday, the last time I messed with this code, I had added in a default option to the Python script that builds the input files for this script for rows that were missing a particular value, and Vim put in a space token on the line-wrap, which just happened to be on the character right before \\n .

[re.split(' ', line.replace('\\n', '').rstrip()) for line in f] gives the desires result.

It is a little strange, I think, that NumPy treats the array as both 1-D and 2-D (allowing me to select data[1] as a row) but I guess if the rows aren't of consistent length it just sees it as an array of arrays rather than a 2-D array, making a distinction between the two.

NumPy thinks a 2-D array is 1-D

Question

2 answers

solution1
1 ACCPTED 2013-06-11 18:17:13

solution2
0 2013-06-10 20:44:20

NumPy thinks a 2-D array is 1-D

Question

2 answers

solution1 1 ACCPTED 2013-06-11 18:17:13

solution2 0 2013-06-10 20:44:20

solution1
1 ACCPTED 2013-06-11 18:17:13

solution2
0 2013-06-10 20:44:20