简体   繁体   English

从txt文件创建数组

[英]create an array from a txt file

I'm new in python and I have a problem. 我是python的新手,有问题。 I have some measured data saved in a txt file. 我有一些测量数据保存在txt文件中。 the data is separated with tabs, it has this structure: 数据用制表符分隔,它具有以下结构:

0   0   -11.007001  -14.222319  2.336769

i have always 32 datapoints per simulation (0,1,2,...,31) and i have 300 simulations (0,1,2...,299), so the data is sorted at first with the number of simulation and then the number of the data point. 我每次模拟总是有32个数据点(0,1,2,...,31)并且我有300次模拟(0,1,2 ...,299),所以数据首先按照模拟次数进行排序然后是数据点的编号。

The first column is the simulation number, the second column is the data point number and the other 3 columns are the x,y,z coordinates. 第一列是模拟编号,第二列是数据点编号,其他三列是x,y,z坐标。

I would like to create a 3d array, the first dimension should be the simulation number, the second the number of the datapoint and the third the three coordinates. 我想创建一个3d数组,第一个维度应该是模拟编号,第二个应该是数据点的编号,第三个应该是三个坐标。

I already started a bit and here is what I have so far: 我已经开始了一点,这是到目前为止的内容:

## read file
coords = [x.split('\t') for x in
          open(f,'r').read().replace('\r','')[:-1].split('\n')]
## extract the information you want
simnum = [int(x[0]) for x in coords]
npts = [int(x[1]) for x in coords]
xyz = array([map(float,x[2:]) for x in coords])

but I don't know how to combine these 2 lists and this one array. 但我不知道如何将这两个列表和一个数组组合在一起。

in the end i would like to have something like this: 最后,我想拥有这样的东西:

array = [simnum][num_dat_point][xyz] 数组= [simnum] [num_dat_point] [xyz]

thanks for your help. 谢谢你的帮助。

I hope you understand my problem, it's my first posting in a python forum, so if I did anything wrong, I'm sorry about this. 希望您能理解我的问题,这是我第一次在python论坛中发帖,因此,如果我做错了任何事情,对此我感到抱歉。

thanks again 再次感谢

you can combine them with zip function , like so: 您可以将它们与zip函数结合使用,如下所示:

for sim, datapoint, x, y, z in zip(simnum, npts, *xyz):
    # do your thing

or you could avoid list comprehensions altogether and just iterate over the lines of the file: 或者您可以完全避免列表解析,而只遍历文件的各行:

for line in open(fname):
    lst = line.split('\t')
    sim, datapoint = int(lst[0]), int(lst[1])
    x, y, z = [float(i) for i in lst[2:]]
    # do your thing

to parse a single line you could (and should) do the following: 要解析一行,您可以(并且应该)执行以下操作:

coords = [x.split('\t') for x in open(fname)]

According to the zen of python, flat is better than nested. 根据python的禅意,扁平比嵌套好。 I'd just use a dict. 我只是用字典。

import csv
f = csv.reader(open('thefile.csv'), delimiter='\t',
               quoting=csv.QUOTE_NONNUMERIC)

result = {}
for simn, dpoint, c1, c2, c3 in f:
    result[simn, dpoint] = c1, c2, c3

# pretty-prints the result:
from pprint import pprint
pprint(result)

This seems like a good opportunity to use itertools.groupby. 这似乎是使用itertools.groupby的好机会。

import itertools
import csv
file = open("data.txt")
reader = csv.reader(file, delimiter='\t')
result = []
for simnumberStr, rows in itertools.groupby(reader, key=lambda t: t[0]):
    simData = []
    for row in rows:
        simData.append([float(v) for v in row[2:]])
    result.append(simData)
file.close()

This will create a 3 dimensional list named 'result'. 这将创建一个名为“结果”的三维列表。 The first index is the simulation number, and the second index is the data index within that simulation. 第一个索引是模拟编号,第二个索引是该模拟内的数据索引。 The value is a list of integers containing the x, y, and z coordinate. 该值是包含x,y和z坐标的整数列表。

Note that this assumes the data is already sorted on simulation number and data number. 请注意,这假设数据已经按照仿真编号和数据编号进行了排序。

essentially the difficulty is what happens if different simulations have different numbers of points. 本质上,难点在于如果不同的模拟具有不同数量的点会发生什么。

You will therefore need to dimension an array to the appropriate sizes first. 因此,您需要首先将数组的尺寸调整为适当的大小。 t should be an array of at least max(simnum) x max(npts) x 3 . t应该是至少max(simnum) x max(npts) x 3的数组。 To eliminate confusion you should initialise with not-a-number, this will allow you to see missing points. 为了消除混淆,您应该使用非数字初始化,这将使您看到缺失的点。

then use something like 然后使用类似

for x in coords:
  t[int(x[0])][int(x[1])][0]=float(x[3])
  t[int(x[0])][int(x[1])][1]=float(x[4])
  t[int(x[0])][int(x[1])][2]=float(x[5])

is this what you meant? 这是你的意思吗?

You could be using many different kinds of containers for your purposes, but none of them has array as an unqualified name -- Python has a module array which you can import from the standard library, but the array.array type is too limited for your purposes (1-D only and with elementary types as contents); 您可能会使用许多不同种类的容器,但它们中没有一个将array作为不合格的名称array.array有一个模块array ,您可以从标准库中导入该array ,但是array.array类型对于您来说太有限了目的(仅1-D,基本类型为内容); there's a popular third-party extension known as numpy , which does have a powerful numpy.array type, which you could use if you has downloaded and installed the extension -- but as you never even once mention numpy I doubt that's what you mean; 有一个流行的第三方扩展名为numpy ,它确实具有强大的numpy.array类型,如果您已经下载并安装了该扩展,则可以使用该扩展。但是,即使您从未提到过numpy我也怀疑这就是您的意思; the relevant builtin types are list and dict . 相关的内置类型是listdict I'll assume you want any container whatsoever -- but if you could learn to use precise terminology in the future, that will substantially help you AND anybody who's trying to help you (say list when you mean list, array only when you DO mean array, "container" when you're uncertain about what container to use, and so forth). 我假设您想要任何容器-但是如果您将来可以学习使用精确的术语,那将对您以及任何试图帮助您的人有实质性的帮助(例如,当您的意思是list时,仅当您这样做时才使用array)数组,“容器”(当您不确定要使用哪个容器时,依此类推)。

I suggest you look at the csv module in the standard library for a more robust way to reading your data, but that's a separate issue. 建议您查看标准库中的csv模块,以更可靠地读取数据,但这是一个单独的问题。 Let's start from when you have the coords list of lists of 5 strings each, each sublist with strings representing two ints followed by three floats. 让我们从拥有5个字符串的列表的coords列表开始,每个子列表的字符串代表2个int和3个浮点数。 Two more key aspects need to be specified... 还需要指定两个关键方面...

One key aspect you don't tell us about: is the list sorted in some significant way? 您没有告诉我们的一个关键方面:列表是否以某种重要方式排序? is there, in particular, some significant order you want to keep? 您特别想保留一些重要的订单吗? As you don't even mention either issue, I will have to assume one way or another, and I'll assume that there isn't any guaranteed nor meaningful order; 正如您甚至没有提到任何一个问题,我将不得不假设一种或另一种方式,并且我假设没有任何保证的或有意义的顺序。 but, no repetition (each pair of simulation/datapoint numbers is not allowed to occur more than once). 但是,不允许重复(每对模拟/数据点编号不允许重复出现一次)。

Second key aspect: are there the same number of datapoints per simulation, in increasing order (0, 1, 2, ...), or is that not necessarily the case (and btw, are the simulation themselves numbered 0, 1, 2, ...)? 第二个关键方面:每个模拟中是否有相同数量的数据点,以递增的顺序(0、1、2 ...),或者不一定是这种情况(顺便说一句,模拟本身是否编号为0、1、2) ,...)? Again, no clue from you on this indispensable part of the specs -- note how many assumptions you're forcing would-be helpers to make by just not telling us about such obviously crucial aspects. 再次,您对规范的这一必不可少的部分一无所知-请注意,您只是不告诉我们如此明显的关键方面而使您成为助手的许多假设。 Don't let people who want to help you stumble in the dark: rather, learn to ask questions the smart way -- this will save untold amounts of time to yourself AND would-be helpers, and give you higher-quality and more relevant help, so, why not do it? 不要让想要帮助您的人陷入困境:相反,学会聪明地提出问题 -这将为您自己和潜在的帮助者节省无数的时间, 为您提供更高品质和更相关的信息帮助,那么,为什么不这样做呢? Anyway, forced to make yet another assumption, I'll have to assume nothing at all is known about the simulation numbers nor about the numers of datapoints in each simulation. 无论如何,被迫做出另一个假设,我将不得不假设,关于模拟编号或每个模拟中的数据点数量一无所知。

With these assumptions dict emerges as the only sensible structure to use for the outer container: a dictionary whose key is a tuple with two items, simulation number then datapoint number within the simulation. 有了这些假设, dict成为用于外部容器的唯一明智的结构:字典,其键是具有两个项的元组,即模拟中的编号,然后是模拟中的数据点编号。 The values may as well be tuple, too (with three floats each), since it does appear that you have exactly 3 coordinates per line. 值也可能是元组(每个都有三个浮点数),因为看起来每行确实有3个坐标。

With all of these assumptions...: 基于所有这些假设...:

def make_container(coords):
  result = dict()
  for s, d, x, y, z in coords:
    key = int(s), int(d)
    value = float(x), float(y), float(z)
    result[key] = value
  return result

It's always best, and fastest, to have all significant code within def statements (ie as functions to be called, possibly with appropriate arguments), so I'm presenting it this way. 将所有重要的代码包含在def语句中(最好是作为要调用的函数,可能带有适当的参数)总是最好的,也是最快的,所以我将以此方式进行介绍。 make_container returns a dictionary which you can address with the simulation number and datapoint number; make_container返回一个字典,您可以使用仿真编号和数据点编号进行寻址; for example, 例如,

d = make_container(coords)
print d[0, 0]

will print the x, y, z for dp 0 of sim 0, assuming one exists (you would get an error if such a sim/dp combination did not exist). 假设存在一个simp的dp 0,它将打印x,y,z(假设不存在这样的sim / dp组合,则会出现错误)。 dicts have many useful methods, eg changing the print statement above to dict有许多有用的方法,例如,将上面的print语句更改为

print d.get((0, 0))

(yes, you do need double parentheses here -- inner ones to make a tuple, outer ones to call get with that tuple as its single argument), you'd see None , rather than get an exception, if there was no such sim/dp combinarion as (0, 0). (是的,这里确实需要双括号-内部的用于创建一个元组,外部的用于使用该元组作为其单个参数调用get ),如果没有这样的模拟,您将看到None ,而不是获得异常。 / dp组合为(0,0)。

If you can edit your question to make your specs more precise (perhaps including some indication of ways you plan to use the resulting container, as well as the various key aspects I've listed above), I might well be able to fine-tune this advice to match your need and circumstances much better (and so might ever other responder, regarding their own advice!), so I strongly recommend you do so -- thanks in advance for helping us help you!-) 如果您可以编辑问题以使规格更加精确(也许包括一些说明,您打算使用所得容器的方式以及上面列出的各个关键方面),那么我可能可以进行微调该建议可以更好地满足您的需求和情况(其他响应者也可以参考他们的建议!),因此,我强烈建议您这样做-预先感谢您为我们提供帮助!-)

First I'd point out that your first data point appears to be an index, and wonder if the data is therefore important or not, but whichever :-) 首先,我要指出您的第一个数据点似乎是一个索引,并想知道数据是否因此重要,但是无论哪个:-)

def parse(line):
    mch = re.compile('^(\d+)\s+(\d+)\s+([-\d\.]+)\s+([-\d\.]+)\s+([-\d\.]+)$')
    m = mch.match(line)
    if m:
        l = m.groups()
        (idx,data,xyz) = (int(l[0]),int(l[1]), map(float, l[2:]))
        return (idx, data, xyz)
    return None

finaldata = []
file = open("data.txt",'r')
for line in file:
    r = parse(line)
    if r is not None:
        finaldata.append(r)

Final data should have output along the lines of: 最终数据应具有以下输出:

[(0, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999]),
 (1, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999]),
 (2, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999]),
 (3, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999]),
 (4, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999])]

This should be pretty robust about dealing w/ the whitespace issues (tabs spaces whatnot)... 这对于处理带空格的问题应该非常健壮(不要使用空格制表符)...

I also wonder how big your data files are, mine are usually large so being able to process them in chunks or groups become more important... Anyway this will work in python 2.6. 我还想知道您的数据文件有多大,我的数据文件通常很大,因此能够按块或组进行处理变得更加重要...无论如何,这将在python 2.6中起作用。

Are you sure a 3d array is what you want? 您确定要使用3D阵列吗? It seems more likely that you want a 2d array, where the simulation number is one dimension, the data point is the second, and then the value stored at that location is the coordinates. 您似乎更可能需要一个2d数组,其中模拟数字是一维,数据点是第二维,然后存储在该位置的值是坐标。

This code will give you that. 此代码将为您提供。

data = []
for coord in coords:
    if coord[0] not in data:
        data[coord[0]] = []
    data[coord[0]][coord[1]] = (coord[2], coord[3], coord[4])

To get the coordinates at simulation 7, data point 13, just do data[7][13] 要获得仿真7的坐标,数据点13,只需执行data [7] [13]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM