Save data from separate columns in a file into a variable in Python 2.7

Question

So I have a sample data in a file, which is of the arrangement:

  u   v   w   p
 100 200 300 400 
 101 201 301 401
 102 202 302 402
 103 203 303 403 
 104 204 304 404
 105 205 305 405
 106 206 306 406
 107 207 307 407

Now I want to read the 1st column and save it into a list 'u' , 2nd column into a list 'v' and so on for every column till 'p'. This is what I have so far:

import numpy as np
u  = []
v  = []
w  = []
p  = []

with open('testdata.dat') as f:
   for line in f:
       for x in line.split():
           u.append([int(x)])
           v.append([int(x)+1])
           w.append([int(x)+2])
           p.append([int(x)+3]) 

print 'u is'
print(u)
print 'v is'
print(v)
print 'w is'
print(w)
print 'p is'
print(p)

I have tried varying the indices, but obviously it is wrong since I get the output

u is
[[100], [200], [300], [400], [101], [201], [301], [401], [102], [202], [302], 
 [402], [103], [203], [303], [403], [104], [204], [304], [404], [105], [205], 
 [305], [405], [106], [206], [306], [406], [107], [207], [307], [407]]

v is
[[101], [201], [301], [401], [102], [202], [302], [402], [103], [203], [303], 
 [403], [104], [204], [304], [404], [105], [205], [305], [405], [106], [206], 
 [306], [406], [107], [207], [307], [407], [108], [208], [308], [408]]

w is
[[102], [202], [302], [402], [103], [203], [303], [403], [104], [204], [304], 
 [404], [105], [205], [305], [405], [106], [206], [306], [406], [107], [207], 
 [307], [407], [108], [208], [308], [408], [109], [209], [309], [409]]

p is
[[103], [203], [303], [403], [104], [204], [304], [404], [105], [205], [305], 
 [405], [106], [206], [306], [406], [107], [207], [307], [407], [108], [208], 
 [308], [408], [109], [209], [309], [409], [110], [210], [310], [410]]

It just increments the row number by the index and reads the entire row, whereas I want data from every column written to a separate variable,ie corresponding to the names given in the sample data - u = 100 --> 107, v = 200 --> 207 etc.

Any ideas on how to do this in Python ? ( I have to perform this operation on really large datasets in an iterative manner,So a fast and efficient code would be of great benefit)

Answer 1

Please change the inner loop:

   for x in line.split():
       u.append([int(x)])
       v.append([int(x)+1])
       w.append([int(x)+2])
       p.append([int(x)+3])

to

   x = line.split()
   u.append([int(x[0])])
   v.append([int(x[1])])
   w.append([int(x[2])])
   p.append([int(x[3])])

In your orginal implement, the statements in the loop "for x in line.split():" would be executed for four times (for each column).

Answer 2

x.append([int(y)+c]) appends a list of one element - int(y)+c

you need x.append(int(y)+c) to get list of numbers instead of list of singletons

also here is pretty nice solution

from itertools import izip

a="""1 2 3 4
10 20 30 40"""

lines= ([int(y) for y in x.split()] for x in a.split("\n"))
cols = izip(*lines)

print list(cols)

prints

[(1, 10), (2, 20), (3, 30), (4, 40)]

The a.split("\\n") would in your case be open("data").readlines() or so

This should give you much better memory performance as you are gonna need to have loaded only one line of the data file in any given time, unless you are gonna continue the computation with turning the generators into list.

However, I don't know how it will performance CPU-wise but my guesstimate is it might be a bit better or about the same as your original code.

If you are gonna benchmark this, it would be also interesting to use just lists instead of generators and try it on pypy (because https://bitbucket.org/pypy/pypy/wiki/JitFriendliness see the generators headline) if you can fit it into the memory.

Considering your data set

  (10**4 * 8 * 12)/1024.0

Assuming your numbers are relatively small and take 12 bytes each ( Python: How much space does each element of a list take? ), that gives me something a little under 1MB of memory to hold all the data at once. Which is pretty tiny data set in terms of memory consumption.

Answer 3

If I understand it well, by using Python build-in functions zip and map , you only need one line to do that:

from itertools import izip

u,v,w,p = izip(*(map(int,line.split()) for line in open('data.txt')))

# Usage (Python3 syntax)
print("u is", list(u))
print("v is", list(v))
print("w is", list(w))
print("p is", list(p))

Producing the following result:

u is [100, 101, 102, 103, 104, 105, 106, 107]
v is [200, 201, 202, 203, 204, 205, 206, 207]
w is [300, 301, 302, 303, 304, 305, 306, 307]
p is [400, 401, 402, 403, 404, 405, 406, 407]

Since this is your concern, implicit looping by using zip and map should exhibit better performances that doing it in python (even if loops are really fast). I'm not sure this solution has better memory footprint thought...

EDIT: replaced zip by izip to use a generator even on python 2.x

Save data from separate columns in a file into a variable in Python 2.7

Question

3 answers

solution1
2 ACCPTED 2013-06-09 09:37:35

solution2
1 2013-06-09 09:37:28

solution3
1 2013-06-09 10:03:34

Save data from separate columns in a file into a variable in Python 2.7

Question

3 answers

solution1 2 ACCPTED 2013-06-09 09:37:35

solution2 1 2013-06-09 09:37:28

solution3 1 2013-06-09 10:03:34

solution1
2 ACCPTED 2013-06-09 09:37:35

solution2
1 2013-06-09 09:37:28

solution3
1 2013-06-09 10:03:34