[英]How to merge elements from multiple lists with same ID in Python?
I have a text file with 670,000 + lines need to process. 我有一个需要处理670,000 +行的文本文件。 Each line has the format of: 每行的格式为:
uid, a, b, c, d, x, y, x1, y1, t, 0,
I did some cleanning and transferred each line to a list: 我做了一些清理,并将每一行转移到一个列表中:
[uid,(x,y,t)]
And my question is: How can I merge (x,y,t)tuples in different lists but have the common uid efficiently? 我的问题是:如何合并不同列表中的(x,y,t)元组,但是有效地拥有公共uid?
For example: I have multiple lists 例如:我有多个列表
[uid1,(x1,y1,t1)]
[uid1,(x2,y2,t2)]
[uid2,(x3,y3,t3)]
[uid3,(x4,y4,t4)]
[uid2,(x5,y5,t5)]
......
And I want to transfer them into: 我想将它们转换为:
[uid1,(x1,y1,t1), (x2,y2,z2)]
[uid2,(x3,y3,t3), (x5,52,z5)]
[uid3,(x4,y4,t4)]
......
Any help would be really appreciated. 任何帮助将非常感激。
You can use the groupby
method from itertools
. 您可以使用itertools
的groupby
方法。 Considering you have your original lists in a variable called lists
: 考虑到您的原始列表位于一个名为lists
的变量lists
:
from itertools import groupby
lists = sorted(lists) # Necessary step to use groupby
grouped_list = groupby(lists, lambda x: x[0])
grouped_list = [(x[0], [k[1] for k in list(x[1])]) for x in grouped_list]
Just use a defaultdict
. 只需使用defaultdict
。
import collections
def group_items(items):
grouped_dict = collections.defaultdict(list)
for item in items:
uid = item[0]
t = item[1]
grouped_dict[uid].append(t)
grouped_list = []
for uid, tuples in grouped_dict.iteritems():
grouped_list.append([uid] + tuples)
return grouped_list
items
is a list of your initial lists. items
是您的初始列表的列表。 grouped_list
will be a list of the grouped lists by uid. grouped_list
将是uid分组列表的列表。
If your data is stored in a dataframe, you can use .groupby
to group by the 'uid', and if you transform the values (x,t,v) to a tuple ((x,t,v),)
, you can .sum
them (ie concatenate them). 如果数据存储在数据.groupby
,则可以使用.groupby
来对'uid'进行分组,如果将值(x,t,v)转换为元组((x,t,v),)
,则可以可以.sum
它们相加(即连接它们)。
Here's an example: 这是一个例子:
df = pd.DataFrame.from_records(
[['a',(1,2,3)],
['b',(1,2,3)],
['a',(10,9,8)]], columns = ['uid', 'foo']
)
df.apply({'uid': lambda x: x, 'foo': lambda x: (x,)}).groupby('uid').sum()
On my end, it produced: 就我而言,它产生了:
uid foo
a ((1, 2, 3), (10, 9, 8))
b ((1, 2, 3),)
How about using defaultdict, like this: 如何使用defaultdict,像这样:
L = [['uid1',(x1,y1,t1)],
['uid1',(x2,y2,t2)],
['uid2',(x3,y3,t3)],
['uid3',(x4,y4,t4)],
['uid2',(x5,y5,t5)]]
from collections import defaultdict
dd = defaultdict(list)
for i in L:
dd[i[0]].append(i[1])
The output: print(dd) 输出: print(dd)
defaultdict(list,
{'uid1': [(x1, y1, t1), (x2, y2, t2)],
'uid2': [(x3, y3, t3), (x5, y5, t5)],
'uid3': [(x4, y4, t4)]})
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.