简体   繁体   English

检查数据框列中的所有值是否相同

[英]Check if all values in dataframe column are the same

I want to do a quick and easy check if all column values for counts are the same in a dataframe:我想快速轻松地检查数据框中counts的所有列值是否相同:

In:在:

import pandas as pd

d = {'names': ['Jim', 'Ted', 'Mal', 'Ted'], 'counts': [3, 4, 3, 3]}
pd.DataFrame(data=d)

Out:出去:

  names  counts
0   Jim       3
1   Ted       4
2   Mal       3
3   Ted       3

I want just a simple condition that if all counts = same value then print('True') .我只想要一个简单的条件, if all counts = same value ,则print('True')

Is there a fast way to do this?有没有快速的方法来做到这一点?

An efficient way to do this is by comparing the first value with the rest, and using all :一种有效的方法是将第一个值与其余值进行比较,并使用all

def is_unique(s):
    a = s.to_numpy() # s.values (pandas<0.24)
    return (a[0] == a).all()

is_unique(df['counts'])
# False

Although the most intuitive idea could possibly be to count the amount of unique values and check if there is only one, this would have a needlessly high complexity for what we're trying to do.尽管最直观的想法可能是计算unique值的数量并检查是否只有一个,但这对于我们正在尝试做的事情来说具有不必要的高复杂性。 Numpy's' np.unique , called by pandas' nunique , implements a sorting of the underlying arrays, which has an evarage complexity of O(n·log(n)) using quicksort (default). Numpy 的np.unique被 pandas 的nunique ,实现了底层数组的排序,它使用快速排序(默认)具有O(n·log(n))的估计复杂度。 The above approach is O(n) .上述方法是O(n)

The difference in performance becomes more obvious when we're applying this to an entire dataframe (see below).当我们将其应用于整个数据帧时,性能差异变得更加明显(见下文)。


For an entire dataframe对于整个数据框

In the case of wanting to perform the same task on an entire dataframe, we can extend the above by setting axis=0 in all :如果想要在整个数据帧上执行相同的任务,我们可以通过设置axis=0来扩展all内容:

def unique_cols(df):
    a = df.to_numpy() # df.values (pandas<0.24)
    return (a[0] == a).all(0)

For the shared example, we'd get:对于共享示例,我们将得到:

unique_cols(df)
# array([False, False])

Here's a benchmark of the above methods compared with some other approaches, such as using nunique (for a pd.Series ):这是上述方法与其他一些方法相比的基准,例如使用nunique (对于pd.Series ):

s_num = pd.Series(np.random.randint(0, 1_000, 1_100_000))

perfplot.show(
    setup=lambda n: s_num.iloc[:int(n)], 

    kernels=[
        lambda s: s.nunique() == 1,
        lambda s: is_unique(s)
    ],

    labels=['nunique', 'first_vs_rest'],
    n_range=[2**k for k in range(0, 20)],
    xlabel='N'
)

在此处输入图像描述


And below are the timings for a pd.DataFrame .以下是pd.DataFrame的时间安排。 Let's compare too with a numba approach, which is especially useful here since we can take advantage of short-cutting as soon as we see a repeated value in a given column ( note: the numba approach will only work with numerical data ):让我们也与numba方法进行比较,这在这里特别有用,因为我们可以在给定列中看到重复值后立即利用快捷方式(注意:numba 方法仅适用于数值数据):

from numba import njit

@njit
def unique_cols_nb(a):
    n_cols = a.shape[1]
    out = np.zeros(n_cols, dtype=np.int32)
    for i in range(n_cols):
        init = a[0, i]
        for j in a[1:, i]:
            if j != init:
                break
        else:
            out[i] = 1
    return out

If we compare the three methods:如果我们比较这三种方法:

df = pd.DataFrame(np.concatenate([np.random.randint(0, 1_000, (500_000, 200)), 
                                  np.zeros((500_000, 10))], axis=1))

perfplot.show(
    setup=lambda n: df.iloc[:int(n),:], 

    kernels=[
        lambda df: (df.nunique(0) == 1).values,
        lambda df: unique_cols_nb(df.values).astype(bool),
        lambda df: unique_cols(df) 
    ],

    labels=['nunique', 'unique_cols_nb', 'unique_cols'],
    n_range=[2**k for k in range(0, 20)],
    xlabel='N'
)

在此处输入图像描述

Update using np.unique使用np.unique更新

len(np.unique(df.counts))==1
False

Or或者

len(set(df.counts.tolist()))==1

Or或者

df.counts.eq(df.counts.iloc[0]).all()
False

Or或者

df.counts.std()==0
False

I prefer:我更喜欢:

df['counts'].eq(df['counts'].iloc[0]).all()

I find it the easiest to read and it works across all value types.我发现它最容易阅读,并且适用于所有值类型。 I have also find it fast enough in my experience.根据我的经验,我也发现它足够快。

I think nunique does much more work than necessary.我认为nunique所做的工作比必要的要多得多。 Iteration can stop at the first difference.迭代可以在第一个差异处停止。 This simple and generic solution uses itertools :这个简单而通用的解决方案使用itertools

import itertools

def all_equal(iterable):
    "Returns True if all elements are equal to each other"
    g = itertools.groupby(iterable)
    return next(g, True) and not next(g, False)

all_equal(df.counts)

One can use this even to find all columns with constant contents in one go:甚至可以使用它一次性找到所有具有恒定内容的列:

constant_columns = df.columns[df.apply(all_equal)]

A slightly more readable but less performant alternative:一个更易读但性能较差的替代方案:

df.counts.min() == df.counts.max()

Add skipna=False here if necessary.如有必要,请在此处添加skipna=False

One simple and efficient way is to check that each row has a unique value.一种简单有效的方法是检查每一行是否具有唯一值。 This is accomplished by measuring the length of the unique output of each row.这是通过测量每行的唯一输出的长度来实现的。 Assuming a df is pd.DataFrame, this can be done like this:假设df是 pd.DataFrame,可以这样完成:

unique = df.apply(lambda row: len(row.unique()) == 1, axis=1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM