如何检查一列的每个值是否映射到另一列中的一个值？

Question

I have a dataframe like this我有一个像这样的 dataframe

import pandas as pd

df = pd.DataFrame({'A':list('bbcddee'), 'B': list('klmnnoi')})

   A  B
0  b  k
1  b  l
2  c  m
3  d  n
4  d  n
5  e  o
6  e  i

and I would like to create a dictionary from the columns A and B using eg我想使用例如从列A和B创建一个字典

dict(zip(df.A, df.B))

Before doing this, I would like to check whether each value in A is mapped to only one value in B ;在此之前，我想检查A中的每个值是否仅映射到B中的一个值； if not, an error should be thrown;如果不是，则应抛出错误； above that is not the case as b is mapped to k and l and e is mapped to o and i .上述情况并非如此，因为b映射到k和l并且e映射到o和i 。

One way of approaching it would be:一种接近它的方法是：

df[df.groupby('A', sort=False)['B'].transform(lambda x: len(set(x))) > 1]

which returns返回

   A  B
0  b  k
1  b  l
5  e  o
6  e  i

However, that requires a lambda which might make it slow.但是，这需要lambda可能会使其变慢。 Does anyone see an option to speed it up?有没有人看到加速它的选项？

Answer 1

You can groupby with nunique to get how many unique values in 'B' belong to each unique value in 'A'.您可以使用groupby进行nunique ，以获取“B”中有多少唯一值属于“A”中的每个唯一值。

df.groupby('A').B.nunique()
#A
#b    2
#c    1
#d    1
#e    2
#Name: B, dtype: int64

And so you can check if any of them have more than 1 mapping:因此，您可以检查其中是否有超过 1 个映射：

df.groupby('A').B.nunique().gt(1).any()
#True

The above is conceptually no different from what you proposed.以上在概念上与您提出的没有什么不同。 However, there is often a major performance gain if you are able to use a built-in groupby operation, which has been "optimized", as opposed to a slow lambda that requires a loop.但是，如果您能够使用已“优化”的内置 groupby 操作，而不是需要循环的慢速 lambda，则通常会显着提高性能。 We can see that as the DataFrame gets large the lambda can become nearly 100x slower, which is a big deal when things are starting to take seconds to compute.我们可以看到，随着 DataFrame 变大，lambda 会变慢近 100 倍，这在计算开始需要几秒钟的时间是很重要的。

import perfplot
import pandas as pd
import numpy as np

def gb_lambda(df):
    return df.groupby('A')['B'].apply(lambda x: len(set(x))).gt(1)

def gb_nunique(df):
    return df.groupby('A').B.nunique().gt(1)

perfplot.show(
    setup=lambda n: pd.DataFrame({'A': np.random.randint(0, n//2, n), 
                                  'B': np.random.randint(0, n//2, n)}),
    kernels=[
        lambda df: gb_lambda(df),
        lambda df: gb_nunique(df),
    ],
    labels=['groupby with lambda', 'Groupby.nunique'],
    n_range=[2 ** k for k in range(2,18)],
    equality_check=np.allclose,  
    xlabel='~len(df)'
)

Answer 2

You can use pd.Series.duplicated and df.duplicated with keep parameter set to False您可以使用pd.Series.duplicated和df.duplicated并将keep参数设置为False

df[df.A.duplicated(keep=False) & (~df.duplicated(keep=False))]

   A  B
0  b  k
1  b  l
5  e  o
6  e  i

Details细节

df.A.duplicated(keep=False) # To eliminate `A` values occur only once

0     True
1     True
2    False # ----> `c` which has no duplicates 
3     True
4     True
5     True
6     True
Name: A, dtype: bool

~df.duplicated(keep=False) # Capture values having different mapping
0     True
1     True
2     True
3    False # ----> d n
4    False # ----> d n
5     True
6     True
dtype: bool

Answer 3

Let us try filter让我们尝试filter

df.groupby('A').filter(lambda x : x['B'].nunique()>1)
   A  B
0  b  k
1  b  l
5  e  o
6  e  i

如何检查一列的每个值是否映射到另一列中的一个值？

问题描述

3 个解决方案

解决方案1
5 已采纳 2020-07-05 15:18:20

解决方案2
4 2020-07-05 15:07:09

Details细节

解决方案3
1 2020-07-05 15:27:15

如何检查一列的每个值是否映射到另一列中的一个值？

问题描述

3 个解决方案

解决方案1 5 已采纳 2020-07-05 15:18:20

解决方案2 4 2020-07-05 15:07:09

Details细节

解决方案3 1 2020-07-05 15:27:15

解决方案1
5 已采纳 2020-07-05 15:18:20

解决方案2
4 2020-07-05 15:07:09

解决方案3
1 2020-07-05 15:27:15