[英]Pandas dataframe column containing list, Get intersection of two consecutive rows
I have following dataframe: 我有以下数据框:
timestmp coulmnis
2015-10-15 18:24:00 set([a,b,c,d,e,f])
2015-10-15 18:27:00 set([a,b,g,h,i])
2015-10-15 18:30:00 set([g,h,j,k,l])
2015-10-15 18:33:00 set([a,b,g,h,j,k,l])
2015-10-15 18:36:00 set([d,e,j,k])
I want to check how many elemnets in a row is same as previous row.My output should look like this: 我想检查一行中有多少个lemnet与上一行相同。我的输出应如下所示:
timestmp coulmnis count_sameAsPrevious
2015-10-15 18:24:00 set([a,b,c,d,e,f]) 0
2015-10-15 18:27:00 set([a,b,g,h,i]) 2
2015-10-15 18:30:00 set([g,h,j,k,l]) 2
2015-10-15 18:33:00 set([a,b,g,h,j,k,l]) 5
2015-10-15 18:36:00 set([d,e,j,k]) 2
What is the most efficient way so that I can avoid a for loop.Any help appreciated!! 最有效的方法是什么,这样我就可以避免for循环了。感谢任何帮助!!
EDIT: 编辑:
df['shiftedColumn'] = df.columnis.shift(1)
df = df.dropna()
Now I want to use len(filter(y.__contains__,x))
to get no of same elements in two columns which contains set
. 现在我想使用
len(filter(y.__contains__,x))
来获取包含set
两列中相同的元素。
You can do this using DataFrame.shift()
to shift the rows by one column and then rename the coulmnis
column to something else, then reset index and merge the dataframes on timestmp
and then use apply()
on the DataFrame. 您可以使用
DataFrame.shift()
将行移动一列,然后将coulmnis
列重命名为其他名称,然后重置索引并在timestmp
上合并数据帧,然后在DataFrame上使用apply()
来执行此操作。 Example (in one line) - 示例(一行)-
df['count'] = df.reset_index().merge(df.shift(1).reset_index().rename(columns={'coulmnis':'newcol'})) \
.set_index('timestmp').apply((lambda x: len(x['coulmnis'] & x['newcol']) if pd.notnull(x['newcol']) else 0),axis=1)
Example in a more readable way - 更具可读性的示例-
mergedf = df.shift(1).reset_index().rename(columns={'coulmnis':'newcol'})
newdf = df.merge(mergedf).set_index('timestmp')
df['count'] = newdf.apply((lambda x: len(x['coulmnis'] & x['newcol']) if pd.notnull(x['newcol']) else 0),axis=1)
Demo - 演示-
In [36]: df
Out[36]:
coulmnis
timestmp
2015-10-15 18:24:00 set([f, b, c, e, d, a])
2015-10-15 18:27:00 set([g, b, i, a, h])
2015-10-15 18:30:00 set([l, g, k, j, h])
2015-10-15 18:33:00 set([b, j, h, k, a, l, g])
2015-10-15 18:36:00 set([d, e, k, j])
In [38]: df['count'] = df.reset_index().merge(df.shift(1).reset_index().rename(columns={'coulmnis':'newcol'})) \
....: .set_index('timestmp').apply((lambda x: len(x['coulmnis'] & x['newcol']) if pd.notnull(x['newcol']) else 0),axis=1)
In [39]: df
Out[39]:
coulmnis count
timestmp
2015-10-15 18:24:00 set([f, b, c, e, d, a]) 0
2015-10-15 18:27:00 set([g, b, i, a, h]) 2
2015-10-15 18:30:00 set([l, g, k, j, h]) 2
2015-10-15 18:33:00 set([b, j, h, k, a, l, g]) 5
2015-10-15 18:36:00 set([d, e, k, j]) 2
My solution: 我的解决方案:
df = pandas.DataFrame({'sets': [set(['a','b','c','d','e','f']), set(['a','b','g','h','i']), set(['g','h','j','k','l']), set(['a','b','g','h','j','k','l'])]})
df['sets_temp'] = pandas.Series([])
df['sets_temp'][1:] = df['sets'][:-1]
df['count'] = pandas.Series([])
df['count'][1:] = df[1:].apply(lambda row: len(row['sets'] & row['sets_temp']), axis=1)
df['count'][:1] = 0
df = df.drop('sets_temp', axis=1)
Output: 输出:
>>> df
sets count
0 set([b, c, d, e, a, f]) 0
1 set([b, h, i, a, g]) 2
2 set([j, h, l, k, g]) 2
3 set([j, b, h, k, l, a, g]) 5
Actually apply()
function is a wrapper on for loop
, so the efficience of apply()
is the same, but it looks like there is no chance to escape of using for loop-like method. 实际上
apply()
函数是for loop
的包装器,因此apply()
的效率是相同的,但是看起来没有机会逃避使用类似for循环的方法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.