基于百分位数筛选大型数据框的最有效方法

Question

I have a large dataframe, about 5 million rows and 200 columns. 我有一个大的数据框，大约有500万行和200列。 I am running the code below to filter out based on percentiles and data types 我正在运行以下代码以根据百分位数和数据类型进行过滤

Code below 下面的代码

col_percentile_filter = 0.98
modeldata_revised_2 = modeldata.loc[:, (modeldata.dtypes!='object') & (modeldata.quantile(col_percentile_filter) >= 1) & (modeldata.min() != modeldata.max())]

The code currently takes a lot of time to run. 该代码当前需要大量时间才能运行。 What is a more efficient way to run this? 什么是更有效的方式来运行此？

Answer 1

When you're running 跑步时

modeldata_revised_2 = modeldata.loc[:, (modeldata.dtypes!='object') & (modeldata.quantile(col_percentile_filter) >= 1) & (modeldata.min() != modeldata.max())]

You're pretty much computing 3 different dataframes then finding the intersection between them. 您要计算3个不同的数据帧，然后找到它们之间的交集。 df.query() is much better for this kind of thing, something like this: df.query（）对于这种事情要好得多，就像这样：

model_min = modeldata.min()
model_max = modeldata.max()
modeldata_revised_2 = modeldata.query("dtypes!='object' & quantile(col_percentile_filter) >= 1 & @model_min != @model_max")

A more comprehensive explanation I found is at: https://jakevdp.github.io/PythonDataScienceHandbook/03.12-performance-eval-and-query.html 我发现的更全面的解释是在： https : //jakevdp.github.io/PythonDataScienceHandbook/03.12-performance-eval-and-query.html

基于百分位数筛选大型数据框的最有效方法

问题描述

Code below 下面的代码

1 个解决方案

解决方案1
1 2019-02-21 22:29:29

基于百分位数筛选大型数据框的最有效方法

问题描述

Code below 下面的代码

1 个解决方案

解决方案1 1 2019-02-21 22:29:29

解决方案1
1 2019-02-21 22:29:29