简体   繁体   English

熊猫在两个列表中的逐行差异

[英]Row-wise difference in two list in pandas

I am using pandas to incrementally find out new elements ie for every row, I'd see whether values in list have been seen before. 我正在使用pandas增量式查找新元素,即对于每一行,我都会查看列表中的值是否曾经被查看过。 If they are, we will ignore them. 如果它们是,我们将忽略它们。 If not, we will select them. 如果没有,我们将选择它们。

I was able to do this using row.iterrows() , but I have >1M rows, so I believe vectorized apply might be better. 我可以使用row.iterrows()做到这一点,但是我有> 1M行,因此我认为矢量化apply可能会更好。

Here's sample data and code. 这是示例数据和代码。 Once you run this code, you will get expected output: 运行此代码后,将获得预期的输出:

from numpy import nan as NA
import collections

df = pd.DataFrame({'ID':['A','B','C','A','B','A','A','A','D','E','E','E'],
                   'Value': [1,2,3,4,3,5,2,3,7,2,3,9]})
#wrap all elements by group in a list
Changed_df=df.groupby('ID')['Value'].apply(list).reset_index() 
Changed_df=Changed_df.rename(columns={'Value' : 'Elements'})
Changed_df=Changed_df.reset_index(drop=True)



def flatten(l):
    for el in l:
        if isinstance(el, collections.Iterable) and not isinstance(el, (str, bytes)):
            yield from flatten(el)
        else:
            yield el

Changed_df["Elements_s"]=Changed_df['Elements'].shift()

#attempt 1: For loop
Changed_df["Diff"]=NA
Changed_df["count"]=0
Elements_so_far = []

#replace NA with empty list in columns that will go through list operations
for col in ["Elements","Elements_s","Diff"]:
    Changed_df[col] = Changed_df[col].apply(lambda d: d if isinstance(d, list) else [])

for idx,row in Changed_df.iterrows():
    diff = list(set(row['Elements']) - set(Elements_so_far))
    Changed_df.at[idx, "Diff"] = diff
    Elements_so_far.append(row['Elements'])
    Elements_so_far = flatten(Elements_so_far)
    Elements_so_far = list(set(Elements_so_far)) #keep unique elements
    Changed_df.loc[idx,"count"]=diff.__len__()

Commentary about the code: 有关代码的注释:

  • I am not a fan of this code because it's clunky and inefficient. 我不喜欢此代码,因为它笨拙且效率低下。
    • I am saying inefficient because I have created Elements_s which holds shifted values. 我说的是效率低下,因为我创建了保存移位值的Elements_s Another reason for inefficiency is for loop through rows. 另一个原因是低效率for ,通过行循环。
  • Elements_so_far keeps track of all the elements we have discovered for every row. Elements_so_far跟踪我们为每一行发现的所有元素。 If there is a new element that shows up, we count that in Diff column. 如果有新的元素出现,我们将在Diff列中进行计数。
  • We also keep track of the length of new elements discovered in count column. 我们还跟踪在count列中发现的新元素的长度。

I'd appreciate if an expert could help me with a vectorized version of the code. 如果专家可以为我提供矢量化版本的代码,我将不胜感激。


I did try the vectorized version, but I couldn't go too far. 我确实尝试了矢量化版本,但走得太远了。

#attempt 2:
Changed_df.apply(lambda x: [i for i in x['Elements'] if i in x['Elements_s']], axis=1)

I was inspired from How to compare two columns both with list of strings and create a new column with unique items? 我从如何将两列都与字符串列表进行比较以及如何创建具有唯一项的新列中得到启发 to do above, but I couldn't do it. 在上面做,但是我做不到。 The linked SO thread does row-wise difference among columns. 链接的SO线程在列之间做逐行差异。

I am using Python 3.6.7 by Anaconda. 我正在使用Anaconda的Python 3.6.7。 Pandas version is 0.23.4 熊猫版本是0.23.4

You could using sort and then use numpy to get the unique indexes and then construct your groupings, eg: 您可以使用sort然后使用numpy来获取unique索引,然后构造分组,例如:

In []:
df = df.sort_values(by='ID').reset_index(drop=True)
_, i = np.unique(df.Value.values, return_index=True)
df.iloc[i].groupby(df.ID).Value.apply(list)

Out[]:
ID
A    [1, 2, 3, 4, 5]
D                [7]
E                [9]
Name: Value, dtype: object

Or to get close to your current output: 或接近您的当前输出:

In []:
df = df.sort_values(by='ID').reset_index(drop=True)
_, i = np.unique(df.Value.values, return_index=True)
s1 = df.groupby(df.ID).Value.apply(list).rename('Elements')
s2 = df.iloc[i].groupby(df.ID).Value.apply(list).rename('Diff').reindex(s1.index, fill_value=[])

pd.concat([s1, s2, s2.apply(len).rename('Count')], axis=1)

Out[]:
           Elements             Diff  Count
ID
A   [1, 4, 5, 2, 3]  [1, 2, 3, 4, 5]      5
B            [2, 3]               []      0
C               [3]               []      0
D               [7]              [7]      1
E         [2, 3, 9]              [9]      1

One alternative using drop duplicates and groupby 一种使用drop duplicatesgroupby替代方法

# Groupby and apply list func.
df1 = df.groupby('ID')['Value'].apply(list).to_frame('Elements')

# Sort values , drop duplicates by Value column then use groupby.
df1['Diff'] = df.sort_values(['ID','Value']).drop_duplicates('Value').groupby('ID')['Value'].apply(list)

# Use str.len for count.
df1['Count'] = df1['Diff'].str.len().fillna(0).astype(int)

# To fill NaN with empty list
df1['Diff'] = df1.Diff.apply(lambda x: x if type(x)==list else []) 


           Elements             Diff   Count
ID               
A   [1, 4, 5, 2, 3]  [1, 2, 3, 4, 5]     5
B            [2, 3]               []     0
C               [3]               []     0
D               [7]              [7]     1
E         [2, 3, 9]              [9]     1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM