简体   繁体   English

将数组中的相等元素分组

[英]Grouping Equal Elements In An Array

I'm writing a program in python, which needs to sort through four columns of data in a text file, and return the four numbers the row with largest number in the third column for each set of identical numbers in the first column. 我正在用python编写程序,该程序需要对文本文件中的四列数据进行排序,并针对第一列中每组相同的数字返回第四列中具有最大数字的行。

For example: 例如:

I need: 我需要:

1.0     19.3    15.5    0.1
1.0     25.0    25.0    0.1
2.0     4.8     3.1     0.1
2.0     7.1     6.4     0.1
2.0     8.6     9.7     0.1
2.0     11.0    14.2    0.1
2.0     13.5    19.0    0.1
2.0     16.0    22.1    0.1
2.0     19.3    22.7    0.1
2.0     25.0    21.7    0.1
3.0     2.5     2.7     0.1
3.0     3.5     4.8     0.1
3.0     4.8     10.0    0.1
3.0     7.1     18.4    0.1
3.0     8.6     21.4    0.1
3.0     11.0    22.4    0.1
3.0     19.3    15.9    0.1
4.0     4.8     16.5    0.1
4.0     7.1     13.9    0.1
4.0     8.6     11.3    0.1
4.0     11.0    9.3     0.1
4.0     19.3    5.3     0.1
4.0     2.5     12.8    0.1
3.0     25.0    13.2    0.1

To return: 返回:

1.0     19.3    15.5    0.1
2.0     19.3    22.7    0.1
3.0     11.0    22.4    0.1
4.0     4.8     16.5    0.1

Here, the row [1.0, 19.3, 15.5, 0.1] is returned because 15.5 is the greatest third column value that any of the rows has, out of all the rows where 1.0 is the first number. 此处,返回行[1.0、19.3、15.5、0.1],因为15.5是任何行中最大的第三列值,在所有行中,第1.0列是第一个数字。 For each set of identical numbers in the first column, the function must return the rows with the greatest value in the third column. 对于第一列中每组相同的数字,该函数必须返回第三列中具有最大值的行。

I am struggling with actually doing this in python, because the loop iterates over EVERY row and finds a maximum, not each 'set' of first column numbers. 我在用python实际做这件事上很挣扎,因为循环遍历每一行并找到一个最大值,而不是第一列号的每个“集合”。

Is there something about for loops that I don't know which could help me do this? 有一些我不知道的for循环可以帮助我做到这一点的东西吗?

Below is what I have so far. 以下是到目前为止的内容。

import numpy as np

C0,C1,C2,C3 = np.loadtxt("FILE.txt",dtype={'names': ('C0', 'C1', 'C2','C3'),'formats': ('f4', 'f4', 'f4','f4')},unpack=True,usecols=(0,1,2,3))

def FUNCTION(C_0,C_1,C_2,C_3):
    for i in range(len(C_1)):
        a = []
        a.append(C_0 [i])   
            for j in range(len(C_0)):
                if C_0[j] == C_0[i]:
                    a.append(C_0 [j])
        return a


print FUNCTION(C0,C1,C2,C3)

where C0,C1,C2, and C3 are columns in the text file, loaded as 1-D arrays. 其中C0,C1,C2和C3是文本文件中的列,以一维数组的形式加载。 Right now I'm just trying to isolate the indexes of the rows with equal C0 values. 现在,我只是想隔离具有相等C0值的行的索引。

An approach could be to use a dict where the value is the row keyed by the first column item. 一种方法可能是使用dict,其中值是第一列项目所键入的行。 This way you won't have to load the whole text file in memory at once. 这样,您将不必一次将整个文本文件加载到内存中。 You can scan line by line and update the dict as you go. 您可以逐行扫描并随时更新字典。

I got some complex because of first and second rows... I believe 25.0 at (2, 3) is your mistake. 由于第一行和第二行,我变得有些复杂...我相信(2,3)的25.0是您的错误。

My code is not a mathematical solution, but it can be work. 我的代码不是数学解决方案,但是可以工作。

import collections

with open("INPUT.txt", "r") as datasheet:
    data = datasheet.read().splitlines()

dataset = collections.OrderedDict()

for dataitem in data:
    temp = dataitem.split("    ")
    # I just wrote this code, input and output was seperated by four spaces
    print(temp)
    if temp[0] in dataset.keys():
        if float(dataset[temp[0]][1]) < float(temp[2]):
            dataset[temp[0]] = [temp[1], temp[2], temp[3]]
    else:
        dataset[temp[0]] = [temp[1], temp[2], temp[3]]

# Some sort code here

with open("OUTPUT.txt", "w") as outputsheet:
    for datakey in dataset.keys():
        datavalue = dataset[datakey]
        outputsheet.write("%s    %s    %s    %s\n" % (datakey, datavalue[0], datavalue[1], datavalue[2]))

Using Numpy and Lambda 使用Numpy和Lambda

Using the properties of a dict with some lambda functions does the trick.. 将dict的属性与一些lambda函数一起使用就可以了。

data = np.loadtxt("FILE.txt",dtype={'names': ('a', 'b', 'c','d'),'formats': ('f4', 'f4', 'f4','f4')},usecols=(0,1,2,3))

# ordering by columns 1 and 3
sorted_data = sorted(data, key=lambda x: (x[0],x[2]))

# dict comprehension mapping the value of first column to a row
# this will overwrite all previous entries as mapping is 1-to-1
ret = {d[0]: list(d) for d in sorted_data}.values()

Alternatively, you can make it a (ugly) one liner.. 或者,您可以将其制成(丑陋的)一个衬管。

ret = {
    d[0]: list(d)
    for d in sorted(np.loadtxt("FILE.txt",dtype={'names': ('a', 'b', 'c','d'),
                                                 'formats': ('f4', 'f4', 'f4','f4')},
                                          usecols=(0,1,2,3)),
                    key=lambda x: (x[0],x[2]))
}.values()

As @Fallen pointed out, this is an inefficient method as you need to read in the whole file. 正如@Fallen指出的那样,这是一种效率低下的方法,因为您需要读取整个文件。 However, for the purposes of this example where the data set is quite small, it's reasonably acceptable. 但是,对于此示例而言,数据集非常小,因此是可以接受的。

Reading one line at a time 一次读取一行

The more efficient way is reading in one line at a time. 更有效的方法是一次读取一行。

import re

# Get the data
with open('data', 'r') as f:
    str_data = f.readlines()

# Convert to dict
d = {}
for s in str_data:
    data = [float(n) for n in re.split(r'\s+', s.strip())]

    if data[0] in d:
        if data[2] >= d[data[0]][2]:
            d[data[0]] = data
    else:
        d[data[0]] = data

print d.values()

The caveat here is that there's no other sorting metric so if you initially have a row for 1.0 with [1.0, 2.0, 3.0, 5.0] then any subsequent line with a 1.0 where the 3rd column is greater or equal to 3.0 will be overwritten, eg [1.0, 1.0, 3.0, 1.0] 这里需要说明的是,有没有其他排序度量所以如果最初有一排1.0[1.0, 2.0, 3.0, 5.0]然后用任何后续线路1.0 ,其中第三列是大于或等于3.0将被覆盖,例如[1.0, 1.0, 3.0, 1.0]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM