Pandas：检查一列是否存在于另一列中

Question

I am new to Python and pandas.我是 Python 和 Pandas 的新手。 I have a dataset that has the following structures.我有一个具有以下结构的数据集。 It is a pandas DF这是一只熊猫DF

city time1              time2
a    [1991, 1992, 1993] [1993,1994,1995]

time1 and time2 represnts the coverage of the data in two sources. time1 和 time2 表示两个来源中数据的覆盖范围。 I would like create a new column that indicates whether time1 and time2 have any intersection, if so return True otherwise False.我想创建一个新列，指示 time1 和 time2 是否有任何交集，如果有则返回 True 否则返回 False。 The task sound very straightforward.任务听起来很简单。 I was thinking about using set operations on the two columns but it did not work as expected.我正在考虑在两列上使用 set 操作，但它没有按预期工作。 Would anyone help me figure this out?有人能帮我解决这个问题吗？

Thanks!谢谢！

I appreciate your help.我很感激你的帮助。

Answer 1

You can iterate through all the columns and change the lists to sets and see if there is are any values in the intersection.您可以遍历所有列并将列表更改为集合，并查看交集中是否有任何值。

df1 = df.applymap(lambda x: set(x) if type(x) == list else set([x]))
df1.apply(lambda x: bool(x.time1 & x.time2), axis=1)

This is a semi-vectorized way that should make it run much faster.这是一种半矢量化的方式，应该使它运行得更快。

df1 = df[['time1', 'time2']].applymap(lambda x: set(x) if type(x) == list else set([x]))
(df1.time1.values & df1.time2.values).astype(bool)

And even a bit faster甚至更快一点

change_to_set = lambda x: set(x) if type(x) == list else set([x])
time1_set = df.time1.map(change_to_set).values
time2_set = df.time2.map(change_to_set).values
(time1_set & time2_set).astype(bool)

Answer 2

Here is kind of ugly, but vectorized approach:这是一种丑陋但矢量化的方法：

In [37]: df
Out[37]:
  city               time1               time2
0    a              [1970]              [1980]
1    b  [1991, 1992, 1993]  [1993, 1994, 1995]
2    c  [2000, 2001, 2002]        [2010, 2011]
3    d        [2015, 2016]              [2016]

In [38]: df['x'] = df.index.isin(
    ...:             pd.DataFrame(df.time1.tolist())
    ...:               .stack().reset_index(name='x')
    ...:               .merge(pd.DataFrame(df.time2.tolist())
    ...:                        .stack().reset_index(name='x'),
    ...:                      on=['level_0','x'])['level_0'])
    ...:

In [39]: df
Out[39]:
  city               time1               time2      x
0    a              [1970]              [1980]  False
1    b  [1991, 1992, 1993]  [1993, 1994, 1995]   True
2    c  [2000, 2001, 2002]        [2010, 2011]  False
3    d        [2015, 2016]              [2016]   True

Timing:定时：

In [54]: df = pd.concat([df] * 10**4, ignore_index=True)

In [55]: df.shape
Out[55]: (40000, 3)

In [56]: %%timeit
    ...: df.index.isin(
    ...:   pd.DataFrame(df.time1.tolist())
    ...:     .stack().reset_index(name='x')
    ...:     .merge(pd.DataFrame(df.time2.tolist())
    ...:              .stack().reset_index(name='x'),
    ...:            on=['level_0','x'])['level_0'])
    ...:
1 loop, best of 3: 253 ms per loop

In [57]: %timeit df.apply(lambda x: bool(set(x.time1) & set(x.time2)), axis=1)
1 loop, best of 3: 5.36 s per loop

Pandas：检查一列是否存在于另一列中

问题描述

2 个解决方案

解决方案1
3 已采纳 2017-06-27 20:14:33

解决方案2
2 2017-06-27 20:29:53

Pandas：检查一列是否存在于另一列中

问题描述

2 个解决方案

解决方案1 3 已采纳 2017-06-27 20:14:33

解决方案2 2 2017-06-27 20:29:53

解决方案1
3 已采纳 2017-06-27 20:14:33

解决方案2
2 2017-06-27 20:29:53