简体   繁体   English

如何连续获得最高和第二频繁值?

[英]How to get the most and second frequent value in a row?

Let' say I have a dataframe with 1 million rows and 30 columns. 假设我有一个包含100万行30列的数据框。 I want to add a column to the dataframe and the value is "the most frequent value of the previous 30 columns". 我想在数据框中添加一列,其值为“前30列中最频繁的值”。 I also want to add the "second most frequent value of the previous 30 columns" 我还想添加“前30列的第二频繁出现的值”

I know that you can do df.mode(axis=1) for "the most frequent value of the previous 30 columns", but it is so slow. 我知道您可以为“前30列中最频繁的值”执行df.mode(axis = 1),但是速度太慢了。

Is there anyway to vectorize this so it could be fast? 无论如何,有矢量化它可以很快吗?

df.mode(axis=1) is already vectorized. df.mode(axis=1)已被矢量化。 However, you may want to consider how it works. 但是,您可能需要考虑其工作原理。 It needs to operate on each row independently, which means you would benefit from "row-major order" which is called C order in NumPy. 它需要独立地对每一行进行操作,这意味着您将受益于“行优先顺序”,在NumPy中称为C顺序。 A Pandas DataFrame is always column-major order, which means that getting 30 values to compute the mode for one row requires touching 30 pages of memory, which is not efficient. Pandas DataFrame始终是列优先的,这意味着获取30个值来计算一行的模式需要触摸30页的内存,这效率不高。

So, try loading your data into a plain NumPy 2D array and see if that helps speed things up. 因此,尝试将数据加载到普通的NumPy 2D数组中,看看是否有助于加快处理速度。 It should. 这应该。

I tried this on my 1.5 GHz laptop: 我在1.5 GHz笔记本电脑上尝试了此操作:

x = np.random.randint(0,5,(10000,30))
df = pd.DataFrame(x)
%timeit df.mode(axis=1)
%timeit scipy.stats.mode(x, axis=1)

The DataFrame way takes 6 seconds (!), whereas the SciPy (row-major) way takes 16 milliseconds for 10k rows. DataFrame方式需要6秒(!),而SciPy(主要行)方式需要10毫秒的行才需要16毫秒。 Even SciPy in column-major order is not much slower, which makes me think the Pandas version is less efficient than it could be. 即使SciPy以列为主的顺序也不会慢很多,这使我认为Pandas版本的效率不如可能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM