简体   繁体   English

基于百分位数筛选大型数据框的最有效方法

[英]Most efficient way to filter a large dataframe based on percentiles

I have a large dataframe, about 5 million rows and 200 columns. 我有一个大的数据框,大约有500万行和200列。 I am running the code below to filter out based on percentiles and data types 我正在运行以下代码以根据百分位数和数据类型进行过滤

Code below 下面的代码

col_percentile_filter = 0.98
modeldata_revised_2 = modeldata.loc[:, (modeldata.dtypes!='object') & (modeldata.quantile(col_percentile_filter) >= 1) & (modeldata.min() != modeldata.max())]

The code currently takes a lot of time to run. 该代码当前需要大量时间才能运行。 What is a more efficient way to run this? 什么是更有效的方式来运行此?

When you're running 跑步时

modeldata_revised_2 = modeldata.loc[:, (modeldata.dtypes!='object') & (modeldata.quantile(col_percentile_filter) >= 1) & (modeldata.min() != modeldata.max())]

You're pretty much computing 3 different dataframes then finding the intersection between them. 您要计算3个不同的数据帧,然后找到它们之间的交集。 df.query() is much better for this kind of thing, something like this: df.query()对于这种事情要好得多,就像这样:

model_min = modeldata.min()
model_max = modeldata.max()
modeldata_revised_2 = modeldata.query("dtypes!='object' & quantile(col_percentile_filter) >= 1 & @model_min != @model_max")

A more comprehensive explanation I found is at: https://jakevdp.github.io/PythonDataScienceHandbook/03.12-performance-eval-and-query.html 我发现的更全面的解释是在: https : //jakevdp.github.io/PythonDataScienceHandbook/03.12-performance-eval-and-query.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pandas 基于 groupby 掩码过滤数据帧的最有效方法 - Pandas most efficient way to filter dataframe based on groupby mask 最有效的分组方式 => 聚合熊猫中的大型数据框 - Most efficient way to groupby => aggregate for large dataframe in pandas 计算大 DataFrame 成对余弦相似度的最有效方法 - Most efficient way of computing pairwise cosine similarity for large DataFrame 循环和更新大型pandas数据帧中的行的最有效方法 - Most efficient way to loop through and update rows in a large pandas dataframe 给定值列表迭代过滤 Pandas dataframe 的最有效方法 - Most Efficient Way to iteratively filter a Pandas dataframe given a list of values 基于另一个创建 dataframe 的最有效方法 - Most efficient way to create dataframe based on another one 这是根据 pandas 中的条件删除 DataFrame 行的最有效方法? - which is the most efficient way to remove DataFrame rows based on a condition in pandas? 将大熊猫数据帧的每一列与同一数据帧的每一列相乘的最有效方法 - Most efficient way to multiply every column of a large pandas dataframe with every other column of the same dataframe 基于优先级过滤 pandas DataFrame 的高效/Pythonic方法 - Efficient/Pythonic way to Filter pandas DataFrame based on priority 以最有效的方式对 Pandas Dataframe 进行排序和过滤 - Sort and Filter Pandas Dataframe in the most efficient manner
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM