简体   繁体   中英

Grouping columns by unique values in Python

I have a data set with two columns and I need to change it from this format:

10  1 
10  5
10  3
11  5
11  4
12  6
12  2

to this

10  1  5  3
11  5  4
12  6  2

I need every unique value in the first column to be on its own row.

I am a beginner with Python and beyond reading in my text file, I'm at a loss for how to proceed.

You can use Pandas dataframes.

import pandas as pd

df = pd.DataFrame({'A':[10,10,10,11,11,12,12],'B':[1,5,3,5,4,6,2]})
print(df)

Output:

    A  B
0  10  1
1  10  5
2  10  3
3  11  5
4  11  4
5  12  6
6  12  2

Let's use groupby and join :

df.groupby('A')['B'].apply(lambda x:' '.join(x.astype(str)))

Output:

A
10    1 5 3
11      5 4
12      6 2
Name: B, dtype: object

an example using itertools.groupby only; this is all in the python standard library (although the pandas version is way more concise!).

assuming the keys you want to group are adjacent this could all be done lazily (no need to have all your data in-memory at any time):

from io import StringIO
from itertools import groupby

text = '''10  1
10  5
10  3
11  5
11  4
12  6
12  2'''

# read and group data:
with StringIO(text) as file:
    keys = []
    res = {}

    data = (line.strip().split() for line in file)

    for k, g in groupby(data, key=lambda x: x[0]):
        keys.append(k)
        res[k] = [item[1] for item in g]

print(keys)  # ['10', '11', '12']
print(res)   # {'12': ['6', '2'], '10': ['1', '5', '3'], '11': ['5', '4']}

# write grouped data:
with StringIO() as out_file:
    for key in keys:
        out_file.write('{:3s}'.format(key))
        out_file.write(' '.join(['{:3s}'.format(item) for item in res[key]]))
        out_file.write('\n')
    print(out_file.getvalue())
    # 10 1   5   3
    # 11 5   4
    # 12 6   2

you can then replace the with StringIO(text) as file: with something like with open('infile.txt', 'r') as file for the program to read your actual file (and similar for the output file with open('outfile.txt', 'w') ).

again: of course you could directly write to the output file every time a key is found; this way you would not need to have all the data in-memory at any time:

with StringIO(text) as file, StringIO() as out_file:

    data = (line.strip().split() for line in file)

    for k, g in groupby(data, key=lambda x: x[0]):
        out_file.write('{:3s}'.format(k))
        out_file.write(' '.join(['{:3s}'.format(item[1]) for item in g]))
        out_file.write('\n')

    print(out_file.getvalue())

Using collections.defaultdict subclass:

import collections
with open('yourfile.txt', 'r') as f:
    d = collections.defaultdict(list)
    for k,v in (l.split() for l in f.read().splitlines()):  # processing each line
        d[k].append(v)             # accumulating values for the same 1st column
    for k,v in sorted(d.items()):  # outputting grouped sequences
        print('%s  %s' % (k,'  '.join(v)))

The output:

10  1  5  3
11  5  4
12  6  2

Using pandas may be easier. You can use read_csv function to read txt file where data is separated by space or spaces.

import pandas as pd

df = pd.read_csv("input.txt", header=None, delimiter="\s+")
# setting column names
df.columns = ['col1', 'col2']
df

This is will give output of dataframe as:

    col1  col2
0    10     1
1    10     5
2    10     3
3    11     5
4    11     4
5    12     6
6    12     2

After reading txt file to dataframe , similar to apply in previous other answer , you can also use aggregate and join :

df_combine = df.groupby('col1')['col2'].agg(lambda col: ' '.join(col.astype('str'))).reset_index()
df_combine

Output:

     col1     col2
0    10       1 5 3
1    11       5 4
2    12       6 2

I found this solution using dictonaries:

with open("data.txt", encoding='utf-8') as data:
    file = data.readlines()

    dic = {}
    for line in file:
        list1 = line.split()
        try:
            dic[list1[0]] += list1[1] + ' '
        except KeyError:
            dic[list1[0]] = list1[1] + ' '

    for k,v in dic.items():
        print(k,v)

OUTPUT

10 1 5 3

11 5 4

12 6 2

Something more functional

def getdata(datafile):
    with open(datafile, encoding='utf-8') as data:
        file = data.readlines()

    dic = {}
    for line in file:
        list1 = line.split()
        try:
            dic[list1[0]] += list1[1] + ' '
        except KeyError:
            dic[list1[0]] = list1[1] + ' '

    for k,v in dic.items():
        v = v.split()
        print(k, ':',v)

getdata("data.txt")

OUTPUT

11 : ['5', '4']

12 : ['6', '2']

10 : ['1', '5', '3']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM