[英]Row-wise difference in two list in pandas
I am using pandas
to incrementally find out new elements ie for every row, I'd see whether values in list have been seen before. 我正在使用
pandas
增量式查找新元素,即对于每一行,我都会查看列表中的值是否曾经被查看过。 If they are, we will ignore them. 如果它们是,我们将忽略它们。 If not, we will select them.
如果没有,我们将选择它们。
I was able to do this using row.iterrows()
, but I have >1M rows, so I believe vectorized apply
might be better. 我可以使用
row.iterrows()
做到这一点,但是我有> 1M行,因此我认为矢量化apply
可能会更好。
Here's sample data and code. 这是示例数据和代码。 Once you run this code, you will get expected output:
运行此代码后,将获得预期的输出:
from numpy import nan as NA
import collections
df = pd.DataFrame({'ID':['A','B','C','A','B','A','A','A','D','E','E','E'],
'Value': [1,2,3,4,3,5,2,3,7,2,3,9]})
#wrap all elements by group in a list
Changed_df=df.groupby('ID')['Value'].apply(list).reset_index()
Changed_df=Changed_df.rename(columns={'Value' : 'Elements'})
Changed_df=Changed_df.reset_index(drop=True)
def flatten(l):
for el in l:
if isinstance(el, collections.Iterable) and not isinstance(el, (str, bytes)):
yield from flatten(el)
else:
yield el
Changed_df["Elements_s"]=Changed_df['Elements'].shift()
#attempt 1: For loop
Changed_df["Diff"]=NA
Changed_df["count"]=0
Elements_so_far = []
#replace NA with empty list in columns that will go through list operations
for col in ["Elements","Elements_s","Diff"]:
Changed_df[col] = Changed_df[col].apply(lambda d: d if isinstance(d, list) else [])
for idx,row in Changed_df.iterrows():
diff = list(set(row['Elements']) - set(Elements_so_far))
Changed_df.at[idx, "Diff"] = diff
Elements_so_far.append(row['Elements'])
Elements_so_far = flatten(Elements_so_far)
Elements_so_far = list(set(Elements_so_far)) #keep unique elements
Changed_df.loc[idx,"count"]=diff.__len__()
Commentary about the code: 有关代码的注释:
Elements_s
which holds shifted values. Elements_s
。 Another reason for inefficiency is for
loop through rows. for
,通过行循环。 Elements_so_far
keeps track of all the elements we have discovered for every row. Elements_so_far
跟踪我们为每一行发现的所有元素。 If there is a new element that shows up, we count that in Diff
column. Diff
列中进行计数。 count
column. count
列中发现的新元素的长度。 I'd appreciate if an expert could help me with a vectorized version of the code. 如果专家可以为我提供矢量化版本的代码,我将不胜感激。
I did try the vectorized version, but I couldn't go too far. 我确实尝试了矢量化版本,但走得太远了。
#attempt 2:
Changed_df.apply(lambda x: [i for i in x['Elements'] if i in x['Elements_s']], axis=1)
I was inspired from How to compare two columns both with list of strings and create a new column with unique items? 我从如何将两列都与字符串列表进行比较以及如何创建具有唯一项的新列中得到启发? to do above, but I couldn't do it.
在上面做,但是我做不到。 The linked SO thread does row-wise difference among columns.
链接的SO线程在列之间做逐行差异。
I am using Python 3.6.7 by Anaconda. 我正在使用Anaconda的Python 3.6.7。 Pandas version is 0.23.4
熊猫版本是0.23.4
You could using sort
and then use numpy to get the unique
indexes and then construct your groupings, eg: 您可以使用
sort
然后使用numpy来获取unique
索引,然后构造分组,例如:
In []:
df = df.sort_values(by='ID').reset_index(drop=True)
_, i = np.unique(df.Value.values, return_index=True)
df.iloc[i].groupby(df.ID).Value.apply(list)
Out[]:
ID
A [1, 2, 3, 4, 5]
D [7]
E [9]
Name: Value, dtype: object
Or to get close to your current output: 或接近您的当前输出:
In []:
df = df.sort_values(by='ID').reset_index(drop=True)
_, i = np.unique(df.Value.values, return_index=True)
s1 = df.groupby(df.ID).Value.apply(list).rename('Elements')
s2 = df.iloc[i].groupby(df.ID).Value.apply(list).rename('Diff').reindex(s1.index, fill_value=[])
pd.concat([s1, s2, s2.apply(len).rename('Count')], axis=1)
Out[]:
Elements Diff Count
ID
A [1, 4, 5, 2, 3] [1, 2, 3, 4, 5] 5
B [2, 3] [] 0
C [3] [] 0
D [7] [7] 1
E [2, 3, 9] [9] 1
One alternative using drop duplicates
and groupby
一种使用
drop duplicates
和groupby
替代方法
# Groupby and apply list func.
df1 = df.groupby('ID')['Value'].apply(list).to_frame('Elements')
# Sort values , drop duplicates by Value column then use groupby.
df1['Diff'] = df.sort_values(['ID','Value']).drop_duplicates('Value').groupby('ID')['Value'].apply(list)
# Use str.len for count.
df1['Count'] = df1['Diff'].str.len().fillna(0).astype(int)
# To fill NaN with empty list
df1['Diff'] = df1.Diff.apply(lambda x: x if type(x)==list else [])
Elements Diff Count
ID
A [1, 4, 5, 2, 3] [1, 2, 3, 4, 5] 5
B [2, 3] [] 0
C [3] [] 0
D [7] [7] 1
E [2, 3, 9] [9] 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.