简体   繁体   English

pandas dataframe 迭代作为列表的单元格值并将每个元素与其他单元格进行比较

[英]pandas dataframe iterate over cell value that is a list and compare each element to other cell

I have a dataframe with 2 columns - a tuple and a list:我有一个 dataframe 有 2 列 - 一个元组和一个列表:

df = t        l
    (1,2) [1,2,3,4,5,6]
    (0,5) [1,4,9]
    (0,4) [9,11]

I want to add a new column of "how many elements from l are in the range of t. So for example, here if will be:我想添加一个新列“l 中有多少元素在 t 的范围内。例如,这里 if 将是:

df =counter  t       l
      2    (1,2) [1,2,3,4,5,6]
      2    (0,5) [1,4,9]
      0    (0,4) [9,11]

What is the best way to do so?最好的方法是什么?

Use list comprehension with generator and sum :将列表推导与生成器和sum一起使用:

df['counter'] = [sum(a <= i <= b for i in y) for (a, b), y in df[['t','l']].to_numpy()]

A bit faster solution with set.intersection is:使用set.intersection的一个更快的解决方案是:

df['counter'] = [len(set(range(a, b+1)).intersection(y)) 
                 for (a, b), y in df[['t','l']].to_numpy()]

print (df)
        t                   l  counter
0  (1, 2)  [1, 2, 3, 4, 5, 6]        2
1  (0, 5)           [1, 4, 9]        2
2  (0, 4)             [9, 11]        0

Performance in test data:测试数据中的表现

#30k rows
df = pd.concat([df] * 10000, ignore_index=True)

In [67]: %timeit [sum(a <= i <= b for i in y) for (a, b), y in df[['t','l']].to_numpy()]
65.3 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [68]: %timeit [len(set(range(a, b+1)).intersection(y)) for (a, b), y in df[['t','l']].to_numpy()]
60.7 ms ± 520 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM