如何连续获得最高和第二频繁值？

Question

Let' say I have a dataframe with 1 million rows and 30 columns. 假设我有一个包含100万行30列的数据框。 I want to add a column to the dataframe and the value is "the most frequent value of the previous 30 columns". 我想在数据框中添加一列，其值为“前30列中最频繁的值”。 I also want to add the "second most frequent value of the previous 30 columns" 我还想添加“前30列的第二频繁出现的值”

I know that you can do df.mode(axis=1) for "the most frequent value of the previous 30 columns", but it is so slow. 我知道您可以为“前30列中最频繁的值”执行df.mode（axis = 1），但是速度太慢了。

Is there anyway to vectorize this so it could be fast? 无论如何，有矢量化它可以很快吗？

Answer 1

df.mode(axis=1) is already vectorized. df.mode(axis=1)已被矢量化。 However, you may want to consider how it works. 但是，您可能需要考虑其工作原理。 It needs to operate on each row independently, which means you would benefit from "row-major order" which is called C order in NumPy. 它需要独立地对每一行进行操作，这意味着您将受益于“行优先顺序”，在NumPy中称为C顺序。 A Pandas DataFrame is always column-major order, which means that getting 30 values to compute the mode for one row requires touching 30 pages of memory, which is not efficient. Pandas DataFrame始终是列优先的，这意味着获取30个值来计算一行的模式需要触摸30页的内存，这效率不高。

So, try loading your data into a plain NumPy 2D array and see if that helps speed things up. 因此，尝试将数据加载到普通的NumPy 2D数组中，看看是否有助于加快处理速度。 It should. 这应该。

I tried this on my 1.5 GHz laptop: 我在1.5 GHz笔记本电脑上尝试了此操作：

x = np.random.randint(0,5,(10000,30))
df = pd.DataFrame(x)
%timeit df.mode(axis=1)
%timeit scipy.stats.mode(x, axis=1)

The DataFrame way takes 6 seconds (!), whereas the SciPy (row-major) way takes 16 milliseconds for 10k rows. DataFrame方式需要6秒（！），而SciPy（主要行）方式需要10毫秒的行才需要16毫秒。 Even SciPy in column-major order is not much slower, which makes me think the Pandas version is less efficient than it could be. 即使SciPy以列为主的顺序也不会慢很多，这使我认为Pandas版本的效率不如可能。

如何连续获得最高和第二频繁值？

问题描述

1 个解决方案

解决方案1
0 2016-05-24 04:20:47

如何连续获得最高和第二频繁值？

问题描述

1 个解决方案

解决方案1 0 2016-05-24 04:20:47

解决方案1
0 2016-05-24 04:20:47