简体   繁体   English

按Python中的唯一值对列进行分组

[英]Grouping columns by unique values in Python

I have a data set with two columns and I need to change it from this format: 我有一个包含两列的数据集,我需要从这种格式更改它:

10  1 
10  5
10  3
11  5
11  4
12  6
12  2

to this 对此

10  1  5  3
11  5  4
12  6  2

I need every unique value in the first column to be on its own row. 我需要第一列中的每个唯一值都在它自己的行上。

I am a beginner with Python and beyond reading in my text file, I'm at a loss for how to proceed. 我是Python的初学者,除了在我的文本文件中阅读之外,我还不知道如何继续。

You can use Pandas dataframes. 您可以使用Pandas数据帧。

import pandas as pd

df = pd.DataFrame({'A':[10,10,10,11,11,12,12],'B':[1,5,3,5,4,6,2]})
print(df)

Output: 输出:

    A  B
0  10  1
1  10  5
2  10  3
3  11  5
4  11  4
5  12  6
6  12  2

Let's use groupby and join : 让我们使用groupbyjoin

df.groupby('A')['B'].apply(lambda x:' '.join(x.astype(str)))

Output: 输出:

A
10    1 5 3
11      5 4
12      6 2
Name: B, dtype: object

an example using itertools.groupby only; 仅使用itertools.groupby的示例; this is all in the python standard library (although the pandas version is way more concise!). 这都在python标准库中(尽管pandas版本更简洁!)。

assuming the keys you want to group are adjacent this could all be done lazily (no need to have all your data in-memory at any time): 假设您要分组的密钥相邻,这可能都是懒惰地完成(不需要在任何时间将所有数据都存储在内存中):

from io import StringIO
from itertools import groupby

text = '''10  1
10  5
10  3
11  5
11  4
12  6
12  2'''

# read and group data:
with StringIO(text) as file:
    keys = []
    res = {}

    data = (line.strip().split() for line in file)

    for k, g in groupby(data, key=lambda x: x[0]):
        keys.append(k)
        res[k] = [item[1] for item in g]

print(keys)  # ['10', '11', '12']
print(res)   # {'12': ['6', '2'], '10': ['1', '5', '3'], '11': ['5', '4']}

# write grouped data:
with StringIO() as out_file:
    for key in keys:
        out_file.write('{:3s}'.format(key))
        out_file.write(' '.join(['{:3s}'.format(item) for item in res[key]]))
        out_file.write('\n')
    print(out_file.getvalue())
    # 10 1   5   3
    # 11 5   4
    # 12 6   2

you can then replace the with StringIO(text) as file: with something like with open('infile.txt', 'r') as file for the program to read your actual file (and similar for the output file with open('outfile.txt', 'w') ). 然后你可以with StringIO(text) as file:替换with StringIO(text) as file:with open('infile.txt', 'r') as file程序的with open('infile.txt', 'r') as file来读取你的实际文件(和open('outfile.txt', 'w')的输出文件类似open('outfile.txt', 'w') )。

again: of course you could directly write to the output file every time a key is found; 再次:当然,每次找到一个键时你都可以直接写入输出文件; this way you would not need to have all the data in-memory at any time: 这样您就不需要随时将所有数据都存储在内存中:

with StringIO(text) as file, StringIO() as out_file:

    data = (line.strip().split() for line in file)

    for k, g in groupby(data, key=lambda x: x[0]):
        out_file.write('{:3s}'.format(k))
        out_file.write(' '.join(['{:3s}'.format(item[1]) for item in g]))
        out_file.write('\n')

    print(out_file.getvalue())

Using collections.defaultdict subclass: 使用collections.defaultdict子类:

import collections
with open('yourfile.txt', 'r') as f:
    d = collections.defaultdict(list)
    for k,v in (l.split() for l in f.read().splitlines()):  # processing each line
        d[k].append(v)             # accumulating values for the same 1st column
    for k,v in sorted(d.items()):  # outputting grouped sequences
        print('%s  %s' % (k,'  '.join(v)))

The output: 输出:

10  1  5  3
11  5  4
12  6  2

Using pandas may be easier. 使用pandas可能更容易。 You can use read_csv function to read txt file where data is separated by space or spaces. 您可以使用read_csv函数读取txt文件,其中数据由空格或空格分隔。

import pandas as pd

df = pd.read_csv("input.txt", header=None, delimiter="\s+")
# setting column names
df.columns = ['col1', 'col2']
df

This is will give output of dataframe as: 这将给出dataframe输出:

    col1  col2
0    10     1
1    10     5
2    10     3
3    11     5
4    11     4
5    12     6
6    12     2

After reading txt file to dataframe , similar to apply in previous other answer , you can also use aggregate and join : 看完后txt文件, dataframe ,类似于apply在以前其他的答案 ,你也可以用aggregatejoin

df_combine = df.groupby('col1')['col2'].agg(lambda col: ' '.join(col.astype('str'))).reset_index()
df_combine

Output: 输出:

     col1     col2
0    10       1 5 3
1    11       5 4
2    12       6 2

I found this solution using dictonaries: 我使用dictonaries找到了这个解决方案:

with open("data.txt", encoding='utf-8') as data:
    file = data.readlines()

    dic = {}
    for line in file:
        list1 = line.split()
        try:
            dic[list1[0]] += list1[1] + ' '
        except KeyError:
            dic[list1[0]] = list1[1] + ' '

    for k,v in dic.items():
        print(k,v)

OUTPUT OUTPUT

10 1 5 3 10 1 5 3

11 5 4 11 5 4

12 6 2 12 6 2

Something more functional 功能更强大的东西

def getdata(datafile):
    with open(datafile, encoding='utf-8') as data:
        file = data.readlines()

    dic = {}
    for line in file:
        list1 = line.split()
        try:
            dic[list1[0]] += list1[1] + ' '
        except KeyError:
            dic[list1[0]] = list1[1] + ' '

    for k,v in dic.items():
        v = v.split()
        print(k, ':',v)

getdata("data.txt")

OUTPUT OUTPUT

11 : ['5', '4'] 11:['5','4']

12 : ['6', '2'] 12:['6','2']

10 : ['1', '5', '3'] 10:['1','5','3']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM