[英]Check if all values in dataframe column are the same
I want to do a quick and easy check if all column values for counts
are the same in a dataframe:我想快速轻松地检查数据框中
counts
的所有列值是否相同:
In:在:
import pandas as pd
d = {'names': ['Jim', 'Ted', 'Mal', 'Ted'], 'counts': [3, 4, 3, 3]}
pd.DataFrame(data=d)
Out:出去:
names counts
0 Jim 3
1 Ted 4
2 Mal 3
3 Ted 3
I want just a simple condition that if all counts = same value
then print('True')
.我只想要一个简单的条件,
if all counts = same value
,则print('True')
。
Is there a fast way to do this?有没有快速的方法来做到这一点?
An efficient way to do this is by comparing the first value with the rest, and using all
:一种有效的方法是将第一个值与其余值进行比较,并使用
all
:
def is_unique(s):
a = s.to_numpy() # s.values (pandas<0.24)
return (a[0] == a).all()
is_unique(df['counts'])
# False
Although the most intuitive idea could possibly be to count the amount of unique
values and check if there is only one, this would have a needlessly high complexity for what we're trying to do.尽管最直观的想法可能是计算
unique
值的数量并检查是否只有一个,但这对于我们正在尝试做的事情来说具有不必要的高复杂性。 Numpy's' np.unique
, called by pandas' nunique
, implements a sorting of the underlying arrays, which has an evarage complexity of O(n·log(n))
using quicksort (default). Numpy 的
np.unique
被 pandas 的nunique
,实现了底层数组的排序,它使用快速排序(默认)具有O(n·log(n))
的估计复杂度。 The above approach is O(n)
.上述方法是
O(n)
。
The difference in performance becomes more obvious when we're applying this to an entire dataframe (see below).当我们将其应用于整个数据帧时,性能差异变得更加明显(见下文)。
In the case of wanting to perform the same task on an entire dataframe, we can extend the above by setting axis=0
in all
:如果想要在整个数据帧上执行相同的任务,我们可以通过设置
axis=0
来扩展all
内容:
def unique_cols(df):
a = df.to_numpy() # df.values (pandas<0.24)
return (a[0] == a).all(0)
For the shared example, we'd get:对于共享示例,我们将得到:
unique_cols(df)
# array([False, False])
Here's a benchmark of the above methods compared with some other approaches, such as using nunique
(for a pd.Series
):这是上述方法与其他一些方法相比的基准,例如使用
nunique
(对于pd.Series
):
s_num = pd.Series(np.random.randint(0, 1_000, 1_100_000))
perfplot.show(
setup=lambda n: s_num.iloc[:int(n)],
kernels=[
lambda s: s.nunique() == 1,
lambda s: is_unique(s)
],
labels=['nunique', 'first_vs_rest'],
n_range=[2**k for k in range(0, 20)],
xlabel='N'
)
And below are the timings for a pd.DataFrame
.以下是
pd.DataFrame
的时间安排。 Let's compare too with a numba
approach, which is especially useful here since we can take advantage of short-cutting as soon as we see a repeated value in a given column ( note: the numba approach will only work with numerical data ):让我们也与
numba
方法进行比较,这在这里特别有用,因为我们可以在给定列中看到重复值后立即利用快捷方式(注意:numba 方法仅适用于数值数据):
from numba import njit
@njit
def unique_cols_nb(a):
n_cols = a.shape[1]
out = np.zeros(n_cols, dtype=np.int32)
for i in range(n_cols):
init = a[0, i]
for j in a[1:, i]:
if j != init:
break
else:
out[i] = 1
return out
If we compare the three methods:如果我们比较这三种方法:
df = pd.DataFrame(np.concatenate([np.random.randint(0, 1_000, (500_000, 200)),
np.zeros((500_000, 10))], axis=1))
perfplot.show(
setup=lambda n: df.iloc[:int(n),:],
kernels=[
lambda df: (df.nunique(0) == 1).values,
lambda df: unique_cols_nb(df.values).astype(bool),
lambda df: unique_cols(df)
],
labels=['nunique', 'unique_cols_nb', 'unique_cols'],
n_range=[2**k for k in range(0, 20)],
xlabel='N'
)
Update using np.unique
使用
np.unique
更新
len(np.unique(df.counts))==1
False
Or或者
len(set(df.counts.tolist()))==1
Or或者
df.counts.eq(df.counts.iloc[0]).all()
False
Or或者
df.counts.std()==0
False
I prefer:我更喜欢:
df['counts'].eq(df['counts'].iloc[0]).all()
I find it the easiest to read and it works across all value types.我发现它最容易阅读,并且适用于所有值类型。 I have also find it fast enough in my experience.
根据我的经验,我也发现它足够快。
I think nunique
does much more work than necessary.我认为
nunique
所做的工作比必要的要多得多。 Iteration can stop at the first difference.迭代可以在第一个差异处停止。 This simple and generic solution uses
itertools
:这个简单而通用的解决方案使用
itertools
:
import itertools
def all_equal(iterable):
"Returns True if all elements are equal to each other"
g = itertools.groupby(iterable)
return next(g, True) and not next(g, False)
all_equal(df.counts)
One can use this even to find all columns with constant contents in one go:甚至可以使用它一次性找到所有具有恒定内容的列:
constant_columns = df.columns[df.apply(all_equal)]
A slightly more readable but less performant alternative:一个更易读但性能较差的替代方案:
df.counts.min() == df.counts.max()
Add skipna=False
here if necessary.如有必要,请在此处添加
skipna=False
。
One simple and efficient way is to check that each row has a unique value.一种简单有效的方法是检查每一行是否具有唯一值。 This is accomplished by measuring the length of the unique output of each row.
这是通过测量每行的唯一输出的长度来实现的。 Assuming a
df
is pd.DataFrame, this can be done like this:假设
df
是 pd.DataFrame,可以这样完成:
unique = df.apply(lambda row: len(row.unique()) == 1, axis=1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.