[英]Grouping columns by unique values in Python
我有一個包含兩列的數據集,我需要從這種格式更改它:
10 1
10 5
10 3
11 5
11 4
12 6
12 2
對此
10 1 5 3
11 5 4
12 6 2
我需要第一列中的每個唯一值都在它自己的行上。
我是Python的初學者,除了在我的文本文件中閱讀之外,我還不知道如何繼續。
您可以使用Pandas數據幀。
import pandas as pd
df = pd.DataFrame({'A':[10,10,10,11,11,12,12],'B':[1,5,3,5,4,6,2]})
print(df)
輸出:
A B
0 10 1
1 10 5
2 10 3
3 11 5
4 11 4
5 12 6
6 12 2
讓我們使用groupby
並join
:
df.groupby('A')['B'].apply(lambda x:' '.join(x.astype(str)))
輸出:
A
10 1 5 3
11 5 4
12 6 2
Name: B, dtype: object
僅使用itertools.groupby
的示例; 這都在python標准庫中(盡管pandas
版本更簡潔!)。
假設您要分組的密鑰相鄰,這可能都是懶惰地完成(不需要在任何時間將所有數據都存儲在內存中):
from io import StringIO
from itertools import groupby
text = '''10 1
10 5
10 3
11 5
11 4
12 6
12 2'''
# read and group data:
with StringIO(text) as file:
keys = []
res = {}
data = (line.strip().split() for line in file)
for k, g in groupby(data, key=lambda x: x[0]):
keys.append(k)
res[k] = [item[1] for item in g]
print(keys) # ['10', '11', '12']
print(res) # {'12': ['6', '2'], '10': ['1', '5', '3'], '11': ['5', '4']}
# write grouped data:
with StringIO() as out_file:
for key in keys:
out_file.write('{:3s}'.format(key))
out_file.write(' '.join(['{:3s}'.format(item) for item in res[key]]))
out_file.write('\n')
print(out_file.getvalue())
# 10 1 5 3
# 11 5 4
# 12 6 2
然后你可以with StringIO(text) as file:
替換with StringIO(text) as file:
用with open('infile.txt', 'r') as file
程序的with open('infile.txt', 'r') as file
來讀取你的實際文件(和open('outfile.txt', 'w')
的輸出文件類似open('outfile.txt', 'w')
)。
再次:當然,每次找到一個鍵時你都可以直接寫入輸出文件; 這樣您就不需要隨時將所有數據都存儲在內存中:
with StringIO(text) as file, StringIO() as out_file:
data = (line.strip().split() for line in file)
for k, g in groupby(data, key=lambda x: x[0]):
out_file.write('{:3s}'.format(k))
out_file.write(' '.join(['{:3s}'.format(item[1]) for item in g]))
out_file.write('\n')
print(out_file.getvalue())
import collections
with open('yourfile.txt', 'r') as f:
d = collections.defaultdict(list)
for k,v in (l.split() for l in f.read().splitlines()): # processing each line
d[k].append(v) # accumulating values for the same 1st column
for k,v in sorted(d.items()): # outputting grouped sequences
print('%s %s' % (k,' '.join(v)))
輸出:
10 1 5 3
11 5 4
12 6 2
使用pandas
可能更容易。 您可以使用read_csv
函數讀取txt
文件,其中數據由空格或空格分隔。
import pandas as pd
df = pd.read_csv("input.txt", header=None, delimiter="\s+")
# setting column names
df.columns = ['col1', 'col2']
df
這將給出dataframe
輸出:
col1 col2
0 10 1
1 10 5
2 10 3
3 11 5
4 11 4
5 12 6
6 12 2
看完后txt
文件, dataframe
,類似於apply
在以前其他的答案 ,你也可以用aggregate
和join
:
df_combine = df.groupby('col1')['col2'].agg(lambda col: ' '.join(col.astype('str'))).reset_index()
df_combine
輸出:
col1 col2
0 10 1 5 3
1 11 5 4
2 12 6 2
我使用dictonaries找到了這個解決方案:
with open("data.txt", encoding='utf-8') as data:
file = data.readlines()
dic = {}
for line in file:
list1 = line.split()
try:
dic[list1[0]] += list1[1] + ' '
except KeyError:
dic[list1[0]] = list1[1] + ' '
for k,v in dic.items():
print(k,v)
OUTPUT
10 1 5 3
11 5 4
12 6 2
功能更強大的東西
def getdata(datafile):
with open(datafile, encoding='utf-8') as data:
file = data.readlines()
dic = {}
for line in file:
list1 = line.split()
try:
dic[list1[0]] += list1[1] + ' '
except KeyError:
dic[list1[0]] = list1[1] + ' '
for k,v in dic.items():
v = v.split()
print(k, ':',v)
getdata("data.txt")
OUTPUT
11:['5','4']
12:['6','2']
10:['1','5','3']
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.