[英]How to check whether each value of one column maps to exactly one value in another column?
I have a dataframe like this我有一个像这样的 dataframe
import pandas as pd
df = pd.DataFrame({'A':list('bbcddee'), 'B': list('klmnnoi')})
A B
0 b k
1 b l
2 c m
3 d n
4 d n
5 e o
6 e i
and I would like to create a dictionary from the columns A
and B
using eg我想使用例如从列
A
和B
创建一个字典
dict(zip(df.A, df.B))
Before doing this, I would like to check whether each value in A
is mapped to only one value in B
;在此之前,我想检查
A
中的每个值是否仅映射到B
中的一个值; if not, an error should be thrown;如果不是,则应抛出错误; above that is not the case as
b
is mapped to k
and l
and e
is mapped to o
and i
.上述情况并非如此,因为
b
映射到k
和l
并且e
映射到o
和i
。
One way of approaching it would be:一种接近它的方法是:
df[df.groupby('A', sort=False)['B'].transform(lambda x: len(set(x))) > 1]
which returns返回
A B
0 b k
1 b l
5 e o
6 e i
However, that requires a lambda
which might make it slow.但是,这需要
lambda
可能会使其变慢。 Does anyone see an option to speed it up?有没有人看到加速它的选项?
You can groupby
with nunique
to get how many unique values in 'B' belong to each unique value in 'A'.您可以使用
groupby
进行nunique
,以获取“B”中有多少唯一值属于“A”中的每个唯一值。
df.groupby('A').B.nunique()
#A
#b 2
#c 1
#d 1
#e 2
#Name: B, dtype: int64
And so you can check if any of them have more than 1 mapping:因此,您可以检查其中是否有超过 1 个映射:
df.groupby('A').B.nunique().gt(1).any()
#True
The above is conceptually no different from what you proposed.以上在概念上与您提出的没有什么不同。 However, there is often a major performance gain if you are able to use a built-in groupby operation, which has been "optimized", as opposed to a slow lambda that requires a loop.
但是,如果您能够使用已“优化”的内置 groupby 操作,而不是需要循环的慢速 lambda,则通常会显着提高性能。 We can see that as the DataFrame gets large the lambda can become nearly 100x slower, which is a big deal when things are starting to take seconds to compute.
我们可以看到,随着 DataFrame 变大,lambda 会变慢近 100 倍,这在计算开始需要几秒钟的时间是很重要的。
import perfplot
import pandas as pd
import numpy as np
def gb_lambda(df):
return df.groupby('A')['B'].apply(lambda x: len(set(x))).gt(1)
def gb_nunique(df):
return df.groupby('A').B.nunique().gt(1)
perfplot.show(
setup=lambda n: pd.DataFrame({'A': np.random.randint(0, n//2, n),
'B': np.random.randint(0, n//2, n)}),
kernels=[
lambda df: gb_lambda(df),
lambda df: gb_nunique(df),
],
labels=['groupby with lambda', 'Groupby.nunique'],
n_range=[2 ** k for k in range(2,18)],
equality_check=np.allclose,
xlabel='~len(df)'
)
You can use pd.Series.duplicated
and df.duplicated
with keep
parameter set to False
您可以使用
pd.Series.duplicated
和df.duplicated
并将keep
参数设置为False
df[df.A.duplicated(keep=False) & (~df.duplicated(keep=False))]
A B
0 b k
1 b l
5 e o
6 e i
df.A.duplicated(keep=False) # To eliminate `A` values occur only once
0 True
1 True
2 False # ----> `c` which has no duplicates
3 True
4 True
5 True
6 True
Name: A, dtype: bool
~df.duplicated(keep=False) # Capture values having different mapping
0 True
1 True
2 True
3 False # ----> d n
4 False # ----> d n
5 True
6 True
dtype: bool
Let us try filter
让我们尝试
filter
df.groupby('A').filter(lambda x : x['B'].nunique()>1)
A B
0 b k
1 b l
5 e o
6 e i
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.