按Python中的唯一值對列進行分組

Question

我有一個包含兩列的數據集，我需要從這種格式更改它：

對此

10  1  5  3
11  5  4
12  6  2

我需要第一列中的每個唯一值都在它自己的行上。

我是Python的初學者，除了在我的文本文件中閱讀之外，我還不知道如何繼續。

Answer 1

您可以使用Pandas數據幀。

import pandas as pd

df = pd.DataFrame({'A':[10,10,10,11,11,12,12],'B':[1,5,3,5,4,6,2]})
print(df)

輸出：

讓我們使用groupby並join ：

df.groupby('A')['B'].apply(lambda x:' '.join(x.astype(str)))

輸出：

A
10    1 5 3
11      5 4
12      6 2
Name: B, dtype: object

Answer 2

僅使用itertools.groupby的示例; 這都在python標准庫中（盡管pandas版本更簡潔！）。

假設您要分組的密鑰相鄰，這可能都是懶惰地完成（不需要在任何時間將所有數據都存儲在內存中）：

from io import StringIO
from itertools import groupby

text = '''10  1
10  5
10  3
11  5
11  4
12  6
12  2'''

# read and group data:
with StringIO(text) as file:
    keys = []
    res = {}

    data = (line.strip().split() for line in file)

    for k, g in groupby(data, key=lambda x: x[0]):
        keys.append(k)
        res[k] = [item[1] for item in g]

print(keys)  # ['10', '11', '12']
print(res)   # {'12': ['6', '2'], '10': ['1', '5', '3'], '11': ['5', '4']}

# write grouped data:
with StringIO() as out_file:
    for key in keys:
        out_file.write('{:3s}'.format(key))
        out_file.write(' '.join(['{:3s}'.format(item) for item in res[key]]))
        out_file.write('\n')
    print(out_file.getvalue())
    # 10 1   5   3
    # 11 5   4
    # 12 6   2

然后你可以with StringIO(text) as file:替換with StringIO(text) as file:用with open('infile.txt', 'r') as file程序的with open('infile.txt', 'r') as file來讀取你的實際文件（和open('outfile.txt', 'w')的輸出文件類似open('outfile.txt', 'w') ）。

再次：當然，每次找到一個鍵時你都可以直接寫入輸出文件; 這樣您就不需要隨時將所有數據都存儲在內存中：

with StringIO(text) as file, StringIO() as out_file:

    data = (line.strip().split() for line in file)

    for k, g in groupby(data, key=lambda x: x[0]):
        out_file.write('{:3s}'.format(k))
        out_file.write(' '.join(['{:3s}'.format(item[1]) for item in g]))
        out_file.write('\n')

    print(out_file.getvalue())

Answer 3

使用collections.defaultdict子類：

import collections
with open('yourfile.txt', 'r') as f:
    d = collections.defaultdict(list)
    for k,v in (l.split() for l in f.read().splitlines()):  # processing each line
        d[k].append(v)             # accumulating values for the same 1st column
    for k,v in sorted(d.items()):  # outputting grouped sequences
        print('%s  %s' % (k,'  '.join(v)))

輸出：

10  1  5  3
11  5  4
12  6  2

Answer 4

使用pandas可能更容易。 您可以使用read_csv函數讀取txt文件，其中數據由空格或空格分隔。

import pandas as pd

df = pd.read_csv("input.txt", header=None, delimiter="\s+")
# setting column names
df.columns = ['col1', 'col2']
df

這將給出dataframe輸出：

    col1  col2
0    10     1
1    10     5
2    10     3
3    11     5
4    11     4
5    12     6
6    12     2

看完后txt文件， dataframe ，類似於apply在以前其他的答案，你也可以用aggregate和join ：

df_combine = df.groupby('col1')['col2'].agg(lambda col: ' '.join(col.astype('str'))).reset_index()
df_combine

輸出：

     col1     col2
0    10       1 5 3
1    11       5 4
2    12       6 2

Answer 5

我使用dictonaries找到了這個解決方案：

with open("data.txt", encoding='utf-8') as data:
    file = data.readlines()

    dic = {}
    for line in file:
        list1 = line.split()
        try:
            dic[list1[0]] += list1[1] + ' '
        except KeyError:
            dic[list1[0]] = list1[1] + ' '

    for k,v in dic.items():
        print(k,v)

OUTPUT

10 1 5 3

11 5 4

12 6 2

功能更強大的東西

def getdata(datafile):
    with open(datafile, encoding='utf-8') as data:
        file = data.readlines()

    dic = {}
    for line in file:
        list1 = line.split()
        try:
            dic[list1[0]] += list1[1] + ' '
        except KeyError:
            dic[list1[0]] = list1[1] + ' '

    for k,v in dic.items():
        v = v.split()
        print(k, ':',v)

getdata("data.txt")

OUTPUT

11：['5'，'4']

12：['6'，'2']

10：['1'，'5'，'3']

按Python中的唯一值對列進行分組

問題描述

5 個解決方案

解決方案1
3 已采納 2017-06-17 15:52:56

解決方案2
1 2017-06-17 16:18:07

解決方案3
1 2017-06-17 16:21:13

解決方案4
0 2017-06-17 16:42:03

解決方案5
0 2017-06-18 04:45:39

按Python中的唯一值對列進行分組

問題描述

5 個解決方案

解決方案1 3 已采納 2017-06-17 15:52:56

解決方案2 1 2017-06-17 16:18:07

解決方案3 1 2017-06-17 16:21:13

解決方案4 0 2017-06-17 16:42:03

解決方案5 0 2017-06-18 04:45:39

解決方案1
3 已采納 2017-06-17 15:52:56

解決方案2
1 2017-06-17 16:18:07

解決方案3
1 2017-06-17 16:21:13

解決方案4
0 2017-06-17 16:42:03

解決方案5
0 2017-06-18 04:45:39