将文件中不同列的数据保存到Python 2.7中的变量中

Question

So I have a sample data in a file, which is of the arrangement: 因此，我在文件中有一个示例数据，其排列方式为：

  u   v   w   p
 100 200 300 400 
 101 201 301 401
 102 202 302 402
 103 203 303 403 
 104 204 304 404
 105 205 305 405
 106 206 306 406
 107 207 307 407

Now I want to read the 1st column and save it into a list 'u' , 2nd column into a list 'v' and so on for every column till 'p'. 现在，我想阅读第一列并将其保存到列表'u'中，第二列保存到列表'v'中，依此类推，直到每一列直到'p'。 This is what I have so far: 这是我到目前为止的内容：

import numpy as np
u  = []
v  = []
w  = []
p  = []

with open('testdata.dat') as f:
   for line in f:
       for x in line.split():
           u.append([int(x)])
           v.append([int(x)+1])
           w.append([int(x)+2])
           p.append([int(x)+3]) 

print 'u is'
print(u)
print 'v is'
print(v)
print 'w is'
print(w)
print 'p is'
print(p)

I have tried varying the indices, but obviously it is wrong since I get the output 我已经尝试过更改索引，但是显然这是错误的，因为我得到了输出

u is
[[100], [200], [300], [400], [101], [201], [301], [401], [102], [202], [302], 
 [402], [103], [203], [303], [403], [104], [204], [304], [404], [105], [205], 
 [305], [405], [106], [206], [306], [406], [107], [207], [307], [407]]

v is
[[101], [201], [301], [401], [102], [202], [302], [402], [103], [203], [303], 
 [403], [104], [204], [304], [404], [105], [205], [305], [405], [106], [206], 
 [306], [406], [107], [207], [307], [407], [108], [208], [308], [408]]

w is
[[102], [202], [302], [402], [103], [203], [303], [403], [104], [204], [304], 
 [404], [105], [205], [305], [405], [106], [206], [306], [406], [107], [207], 
 [307], [407], [108], [208], [308], [408], [109], [209], [309], [409]]

p is
[[103], [203], [303], [403], [104], [204], [304], [404], [105], [205], [305], 
 [405], [106], [206], [306], [406], [107], [207], [307], [407], [108], [208], 
 [308], [408], [109], [209], [309], [409], [110], [210], [310], [410]]

It just increments the row number by the index and reads the entire row, whereas I want data from every column written to a separate variable,ie corresponding to the names given in the sample data - u = 100 --> 107, v = 200 --> 207 etc. 它只是通过索引增加行号并读取整行，而我希望将每一列中的数据写入一个单独的变量，即与示例数据中给定的名称相对应-u = 100-> 107，v = 200 -> 207等。

Any ideas on how to do this in Python ? 关于如何在Python中执行此操作的任何想法？ ( I have to perform this operation on really large datasets in an iterative manner,So a fast and efficient code would be of great benefit) （我必须以迭代方式在非常大的数据集上执行此操作，因此快速高效的代码将大有裨益）

Answer 1

Please change the inner loop: 请更改内部循环：

   for x in line.split():
       u.append([int(x)])
       v.append([int(x)+1])
       w.append([int(x)+2])
       p.append([int(x)+3])

to 至

   x = line.split()
   u.append([int(x[0])])
   v.append([int(x[1])])
   w.append([int(x[2])])
   p.append([int(x[3])])

In your orginal implement, the statements in the loop "for x in line.split():" would be executed for four times (for each column). 在您的原始实现中，循环“ for line.split（）：中的x”中的语句将执行四次（对于每一列）。

Answer 2

x.append([int(y)+c]) appends a list of one element - int(y)+c x.append([int(y)+c])附加一个元素的列表x.append([int(y)+c]) int(y)+c

you need x.append(int(y)+c) to get list of numbers instead of list of singletons 您需要x.append(int(y)+c)以获得数字列表，而不是单例列表

also here is pretty nice solution 这里也是很好的解决方案

from itertools import izip

a="""1 2 3 4
10 20 30 40"""

lines= ([int(y) for y in x.split()] for x in a.split("\n"))
cols = izip(*lines)

print list(cols)

prints 版画

[(1, 10), (2, 20), (3, 30), (4, 40)]

The a.split("\\n") would in your case be open("data").readlines() or so 在您的情况下， a.split("\\n")将是open("data").readlines()左右

This should give you much better memory performance as you are gonna need to have loaded only one line of the data file in any given time, unless you are gonna continue the computation with turning the generators into list. 这将为您提供更好的内存性能，因为在任何给定时间内您只需要加载一行数据文件，除非您要通过将生成器转换为列表来继续进行计算。

However, I don't know how it will performance CPU-wise but my guesstimate is it might be a bit better or about the same as your original code. 但是，我不知道它将如何在CPU方面发挥作用，但是我猜测是它可能会更好，或者与您的原始代码大致相同。

If you are gonna benchmark this, it would be also interesting to use just lists instead of generators and try it on pypy (because https://bitbucket.org/pypy/pypy/wiki/JitFriendliness see the generators headline) if you can fit it into the memory. 如果您要对此进行基准测试，也可以只使用列表而不是生成器，然后在pypy上尝试一下（因为https://bitbucket.org/pypy/pypy/wiki/JitFriendliness请参见生成器标题），如果合适的话它进入内存。

Considering your data set 考虑您的数据集

  (10**4 * 8 * 12)/1024.0

Assuming your numbers are relatively small and take 12 bytes each ( Python: How much space does each element of a list take? ), that gives me something a little under 1MB of memory to hold all the data at once. 假设您的数字相对较小，每个数字占用12个字节（ Python：列表的每个元素需要占用多少空间？），这给我带来了不到1MB的内存，可以一次容纳所有数据。 Which is pretty tiny data set in terms of memory consumption. 就内存消耗而言，这是非常小的数据集。

Answer 3

If I understand it well, by using Python build-in functions zip and map , you only need one line to do that: 如果我很了解，通过使用Python内置函数zip和map ，您只需一行即可：

from itertools import izip

u,v,w,p = izip(*(map(int,line.split()) for line in open('data.txt')))

# Usage (Python3 syntax)
print("u is", list(u))
print("v is", list(v))
print("w is", list(w))
print("p is", list(p))

Producing the following result: 产生以下结果：

u is [100, 101, 102, 103, 104, 105, 106, 107]
v is [200, 201, 202, 203, 204, 205, 206, 207]
w is [300, 301, 302, 303, 304, 305, 306, 307]
p is [400, 401, 402, 403, 404, 405, 406, 407]

Since this is your concern, implicit looping by using zip and map should exhibit better performances that doing it in python (even if loops are really fast). 由于这是您的关注点，因此使用zip和map 隐式循环应表现出比python更好的性能（即使循环确实非常快）。 I'm not sure this solution has better memory footprint thought... 我不确定此解决方案是否具有更好的内存占用量...

EDIT: replaced zip by izip to use a generator even on python 2.x 编辑：用izip替换了zip ，甚至在python 2.x上也使用了生成器

将文件中不同列的数据保存到Python 2.7中的变量中

问题描述

3 个解决方案

解决方案1
2 已采纳 2013-06-09 09:37:35

解决方案2
1 2013-06-09 09:37:28

解决方案3
1 2013-06-09 10:03:34

将文件中不同列的数据保存到Python 2.7中的变量中

问题描述

3 个解决方案

解决方案1 2 已采纳 2013-06-09 09:37:35

解决方案2 1 2013-06-09 09:37:28

解决方案3 1 2013-06-09 10:03:34

解决方案1
2 已采纳 2013-06-09 09:37:35

解决方案2
1 2013-06-09 09:37:28

解决方案3
1 2013-06-09 10:03:34