包含列表的熊猫数据框列，获取两个连续行的交集

Question

I have following dataframe: 我有以下数据框：

timestmp              coulmnis                                                
2015-10-15 18:24:00  set([a,b,c,d,e,f])
2015-10-15 18:27:00  set([a,b,g,h,i])
2015-10-15 18:30:00  set([g,h,j,k,l])
2015-10-15 18:33:00  set([a,b,g,h,j,k,l])
2015-10-15 18:36:00  set([d,e,j,k])

I want to check how many elemnets in a row is same as previous row.My output should look like this: 我想检查一行中有多少个lemnet与上一行相同。我的输出应如下所示：

timestmp              coulmnis                   count_sameAsPrevious                          
2015-10-15 18:24:00  set([a,b,c,d,e,f])          0
2015-10-15 18:27:00  set([a,b,g,h,i])            2
2015-10-15 18:30:00  set([g,h,j,k,l])            2
2015-10-15 18:33:00  set([a,b,g,h,j,k,l])        5
2015-10-15 18:36:00  set([d,e,j,k])              2

What is the most efficient way so that I can avoid a for loop.Any help appreciated!! 最有效的方法是什么，这样我就可以避免for循环了。感谢任何帮助！！

EDIT: 编辑：

df['shiftedColumn'] = df.columnis.shift(1)
df = df.dropna()

Now I want to use len(filter(y.__contains__,x)) to get no of same elements in two columns which contains set . 现在我想使用len(filter(y.__contains__,x))来获取包含set两列中相同的元素。

Answer 1

You can do this using DataFrame.shift() to shift the rows by one column and then rename the coulmnis column to something else, then reset index and merge the dataframes on timestmp and then use apply() on the DataFrame. 您可以使用DataFrame.shift()将行移动一列，然后将coulmnis列重命名为其他名称，然后重置索引并在timestmp上合并数据帧，然后在DataFrame上使用apply()来执行此操作。 Example (in one line) - 示例（一行）-

df['count'] = df.reset_index().merge(df.shift(1).reset_index().rename(columns={'coulmnis':'newcol'})) \
                .set_index('timestmp').apply((lambda x: len(x['coulmnis'] & x['newcol']) if pd.notnull(x['newcol']) else 0),axis=1)

Example in a more readable way - 更具可读性的示例-

mergedf = df.shift(1).reset_index().rename(columns={'coulmnis':'newcol'})
newdf = df.merge(mergedf).set_index('timestmp')
df['count'] = newdf.apply((lambda x: len(x['coulmnis'] & x['newcol']) if pd.notnull(x['newcol']) else 0),axis=1)

Demo - 演示-

In [36]: df
Out[36]:
                                       coulmnis
timestmp
2015-10-15 18:24:00     set([f, b, c, e, d, a])
2015-10-15 18:27:00        set([g, b, i, a, h])
2015-10-15 18:30:00        set([l, g, k, j, h])
2015-10-15 18:33:00  set([b, j, h, k, a, l, g])
2015-10-15 18:36:00           set([d, e, k, j])

In [38]: df['count'] = df.reset_index().merge(df.shift(1).reset_index().rename(columns={'coulmnis':'newcol'})) \
   ....:                 .set_index('timestmp').apply((lambda x: len(x['coulmnis'] & x['newcol']) if pd.notnull(x['newcol']) else 0),axis=1)

In [39]: df
Out[39]:
                                       coulmnis  count
timestmp
2015-10-15 18:24:00     set([f, b, c, e, d, a])      0
2015-10-15 18:27:00        set([g, b, i, a, h])      2
2015-10-15 18:30:00        set([l, g, k, j, h])      2
2015-10-15 18:33:00  set([b, j, h, k, a, l, g])      5
2015-10-15 18:36:00           set([d, e, k, j])      2

Answer 2

My solution: 我的解决方案：

df = pandas.DataFrame({'sets': [set(['a','b','c','d','e','f']), set(['a','b','g','h','i']), set(['g','h','j','k','l']), set(['a','b','g','h','j','k','l'])]})
df['sets_temp'] = pandas.Series([])
df['sets_temp'][1:] = df['sets'][:-1]
df['count'] = pandas.Series([])
df['count'][1:] = df[1:].apply(lambda row: len(row['sets'] & row['sets_temp']), axis=1)
df['count'][:1] = 0
df = df.drop('sets_temp', axis=1)

Output: 输出：

>>> df
                         sets  count
0     set([b, c, d, e, a, f])      0
1        set([b, h, i, a, g])      2
2        set([j, h, l, k, g])      2
3  set([j, b, h, k, l, a, g])      5

Actually apply() function is a wrapper on for loop , so the efficience of apply() is the same, but it looks like there is no chance to escape of using for loop-like method. 实际上apply()函数是for loop的包装器，因此apply()的效率是相同的，但是看起来没有机会逃避使用类似for循环的方法。

包含列表的熊猫数据框列，获取两个连续行的交集

问题描述

2 个解决方案

解决方案1
3 已采纳 2015-10-20 09:50:26

解决方案2
2 2015-10-20 09:10:31

包含列表的熊猫数据框列，获取两个连续行的交集

问题描述

2 个解决方案

解决方案1 3 已采纳 2015-10-20 09:50:26

解决方案2 2 2015-10-20 09:10:31

解决方案1
3 已采纳 2015-10-20 09:50:26

解决方案2
2 2015-10-20 09:10:31