[英]Grouping columns by unique values in Python
我有一个包含两列的数据集,我需要从这种格式更改它:
10 1
10 5
10 3
11 5
11 4
12 6
12 2
对此
10 1 5 3
11 5 4
12 6 2
我需要第一列中的每个唯一值都在它自己的行上。
我是Python的初学者,除了在我的文本文件中阅读之外,我还不知道如何继续。
您可以使用Pandas数据帧。
import pandas as pd
df = pd.DataFrame({'A':[10,10,10,11,11,12,12],'B':[1,5,3,5,4,6,2]})
print(df)
输出:
A B
0 10 1
1 10 5
2 10 3
3 11 5
4 11 4
5 12 6
6 12 2
让我们使用groupby
并join
:
df.groupby('A')['B'].apply(lambda x:' '.join(x.astype(str)))
输出:
A
10 1 5 3
11 5 4
12 6 2
Name: B, dtype: object
仅使用itertools.groupby
的示例; 这都在python标准库中(尽管pandas
版本更简洁!)。
假设您要分组的密钥相邻,这可能都是懒惰地完成(不需要在任何时间将所有数据都存储在内存中):
from io import StringIO
from itertools import groupby
text = '''10 1
10 5
10 3
11 5
11 4
12 6
12 2'''
# read and group data:
with StringIO(text) as file:
keys = []
res = {}
data = (line.strip().split() for line in file)
for k, g in groupby(data, key=lambda x: x[0]):
keys.append(k)
res[k] = [item[1] for item in g]
print(keys) # ['10', '11', '12']
print(res) # {'12': ['6', '2'], '10': ['1', '5', '3'], '11': ['5', '4']}
# write grouped data:
with StringIO() as out_file:
for key in keys:
out_file.write('{:3s}'.format(key))
out_file.write(' '.join(['{:3s}'.format(item) for item in res[key]]))
out_file.write('\n')
print(out_file.getvalue())
# 10 1 5 3
# 11 5 4
# 12 6 2
然后你可以with StringIO(text) as file:
替换with StringIO(text) as file:
用with open('infile.txt', 'r') as file
程序的with open('infile.txt', 'r') as file
来读取你的实际文件(和open('outfile.txt', 'w')
的输出文件类似open('outfile.txt', 'w')
)。
再次:当然,每次找到一个键时你都可以直接写入输出文件; 这样您就不需要随时将所有数据都存储在内存中:
with StringIO(text) as file, StringIO() as out_file:
data = (line.strip().split() for line in file)
for k, g in groupby(data, key=lambda x: x[0]):
out_file.write('{:3s}'.format(k))
out_file.write(' '.join(['{:3s}'.format(item[1]) for item in g]))
out_file.write('\n')
print(out_file.getvalue())
import collections
with open('yourfile.txt', 'r') as f:
d = collections.defaultdict(list)
for k,v in (l.split() for l in f.read().splitlines()): # processing each line
d[k].append(v) # accumulating values for the same 1st column
for k,v in sorted(d.items()): # outputting grouped sequences
print('%s %s' % (k,' '.join(v)))
输出:
10 1 5 3
11 5 4
12 6 2
使用pandas
可能更容易。 您可以使用read_csv
函数读取txt
文件,其中数据由空格或空格分隔。
import pandas as pd
df = pd.read_csv("input.txt", header=None, delimiter="\s+")
# setting column names
df.columns = ['col1', 'col2']
df
这将给出dataframe
输出:
col1 col2
0 10 1
1 10 5
2 10 3
3 11 5
4 11 4
5 12 6
6 12 2
看完后txt
文件, dataframe
,类似于apply
在以前其他的答案 ,你也可以用aggregate
和join
:
df_combine = df.groupby('col1')['col2'].agg(lambda col: ' '.join(col.astype('str'))).reset_index()
df_combine
输出:
col1 col2
0 10 1 5 3
1 11 5 4
2 12 6 2
我使用dictonaries找到了这个解决方案:
with open("data.txt", encoding='utf-8') as data:
file = data.readlines()
dic = {}
for line in file:
list1 = line.split()
try:
dic[list1[0]] += list1[1] + ' '
except KeyError:
dic[list1[0]] = list1[1] + ' '
for k,v in dic.items():
print(k,v)
OUTPUT
10 1 5 3
11 5 4
12 6 2
功能更强大的东西
def getdata(datafile):
with open(datafile, encoding='utf-8') as data:
file = data.readlines()
dic = {}
for line in file:
list1 = line.split()
try:
dic[list1[0]] += list1[1] + ' '
except KeyError:
dic[list1[0]] = list1[1] + ' '
for k,v in dic.items():
v = v.split()
print(k, ':',v)
getdata("data.txt")
OUTPUT
11:['5','4']
12:['6','2']
10:['1','5','3']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.