Load .dat and .npy into Python

Question

How can I read and store an 8D array in Python from .dat file? My binary file looks like this. I want each string to be a row

['r 11 1602 24 1622 0\n', 'i 26 1602 36 1631 0\n', 
'v 37 1602 57 1621 0\n', 'e 59 1602 76 1622 0\n', 
'r 77 1602 91 1622 1\n', 'h 106 1602 127 1631 0\n', 
'e 127 1602 144 1622 1\n', 'h 160 1602 181 1631 0\n',
'e 181 1602 198 1622 0\n', 'a 200 1602 218 1622 0\n',
'r 218 1602 232 1622 0\n', 'd 234 1602 254 1631 1\n',
't 268 1602 280 1627 0\n', 'h 280 1602 301 1631 0\n',
'e 302 1602 319 1622 1\n', 'd 335 1602 355 1631 0\n']

When I tried this:

file1 = open('data/train1.dat', 'rb')
train1_dat = np.loadtxt(file1.readlines(), delimiter=',')  
print train1_dat

I got this error

ValueError: could not convert string to float: r 11 1602 24 1622 0

Answer 1

Assuming your .dat file is exactly as in your question, we first create a data string that mimics this format. We read it into a data string and then munge it into a format suitable for loading into numpy

from StringIO import StringIO

d = StringIO("""['r 11 1602 24 1622 0\n', 'i 26 1602 36 1631 0\n', 
'v 37 1602 57 1621 0\n', 'e 59 1602 76 1622 0\n', 
'r 77 1602 91 1622 1\n', 'h 106 1602 127 1631 0\n', 
'e 127 1602 144 1622 1\n', 'h 160 1602 181 1631 0\n',
'e 181 1602 198 1622 0\n', 'a 200 1602 218 1622 0\n',
'r 218 1602 232 1622 0\n', 'd 234 1602 254 1631 1\n',
't 268 1602 280 1627 0\n', 'h 280 1602 301 1631 0\n',
'e 302 1602 319 1622 1\n', 'd 335 1602 355 1631 0\n'] """)

data = d.read()  # read contents of .dat file
data = data.strip()  # remove trailing newline
data = data.replace('\n', '')  # remove all newlines
data = data.replace("', '", "','")  # clean up separators
data = data[2:-2]  # remove leading and trailing delimiters
data = data.split("','")  # convert into a clean list
data = '\n'.join(data)  # re-combine into a string to load into numpy

print(data)  # have a look at the new string format

The resulting .dat string looks like this:

r 11 1602 24 1622 0
i 26 1602 36 1631 0
v 37 1602 57 1621 0
e 59 1602 76 1622 0
r 77 1602 91 1622 1
h 106 1602 127 1631 0
e 127 1602 144 1622 1
h 160 1602 181 1631 0
e 181 1602 198 1622 0
a 200 1602 218 1622 0
r 218 1602 232 1622 0
d 234 1602 254 1631 1
t 268 1602 280 1627 0
h 280 1602 301 1631 0
e 302 1602 319 1622 1
d 335 1602 355 1631 0

Silly footnote: I find it intriguing that the first column seems to be an acrostic: "river he heard the d..." and the 1 in the last column marks the end of each word :-) anyway, none of my business.

More seriously, if you could arrange to have your .dat file in this format from the beginning, then all the steps above would be unnecessary. Now we are ready to import easily into a numpy array:

import numpy as np

d = StringIO(data)
# The column names 'a' to 'f' are arbitrary 
# and can be changed to suit
# also the numbers are all arbitrarily imported as floats
data = np.loadtxt(d, dtype={'names': ('a', 'b', 'c', 'd', 'e', 'f'),
                            'formats': ('S1', 'f', 'f', 'f', 'f', 'f')})
print(data)

Here is the result:

[('r', 11.0, 1602.0, 24.0, 1622.0, 0.0)
 ('i', 26.0, 1602.0, 36.0, 1631.0, 0.0)
 ('v', 37.0, 1602.0, 57.0, 1621.0, 0.0)
 ('e', 59.0, 1602.0, 76.0, 1622.0, 0.0)
 ('r', 77.0, 1602.0, 91.0, 1622.0, 1.0)
 ('h', 106.0, 1602.0, 127.0, 1631.0, 0.0)
 ('e', 127.0, 1602.0, 144.0, 1622.0, 1.0)
 ('h', 160.0, 1602.0, 181.0, 1631.0, 0.0)
 ('e', 181.0, 1602.0, 198.0, 1622.0, 0.0)
 ('a', 200.0, 1602.0, 218.0, 1622.0, 0.0)
 ('r', 218.0, 1602.0, 232.0, 1622.0, 0.0)
 ('d', 234.0, 1602.0, 254.0, 1631.0, 1.0)
 ('t', 268.0, 1602.0, 280.0, 1627.0, 0.0)
 ('h', 280.0, 1602.0, 301.0, 1631.0, 0.0)
 ('e', 302.0, 1602.0, 319.0, 1622.0, 1.0)
 ('d', 335.0, 1602.0, 355.0, 1631.0, 0.0)]

Load .dat and .npy into Python

Question

1 answers

solution1
1 ACCPTED 2015-11-07 14:31:33

Load .dat and .npy into Python

Question

1 answers

solution1 1 ACCPTED 2015-11-07 14:31:33

solution1
1 ACCPTED 2015-11-07 14:31:33