繁体 English 中英

在大型数据框中减少因子级别的 Pythonic 方法

[英]Pythonic Way of Reducing Factor Levels in Large Dataframe

原文 2020-03-30 21:10:24 3 1 python/ pandas/ categories/ bucket/ levels

我正在尝试减少 Pandas 数据框中列中的因子级别数，以便将任何因子的总实例作为低于定义阈值（默认设置为 1%）的所有列行的比例，将被分入一个标记为“其他”的新因素。 以下是我用来完成此任务的函数：

def condenseMe(df, column_name, threshold = 0.01, newLabel = "Other"):

    valDict = dict(df[column_name].value_counts() / len(df[column_name]))
    toCondense = [v for v in valDict.keys() if valDict[v] < threshold]
    if 'Missing' in toCondense:
        toCondense.remove('Missing')
    df[column_name] = df[column_name].apply(lambda x: newLabel if x in toCondense else x)

我遇到的问题是我正在处理一个大型数据集（约 1800 万行），并试图在具有 10,000 多个级别的列上使用此函数。 因此，在此列上执行此功能需要很长时间才能完成。 有没有更pythonic的方法来减少执行速度更快的因子级别的数量？ 任何帮助将非常感激！

1 个解决方案

您可以使用groupby 、 tranform和count的组合来做到这一点：

def condenseMe(df, col, threshold = 0.01, newLabel="Other"):
    # Create a new Series with the normalized value counts
    counts = df[[col]].groupby(col)[col].transform('count') / len(df)
    # Create a 1D mask based on threshold (ignoring "Missing")
    mask = (counts < threshold) & (df[col] != 'Missing')

    # Assign these masked values a new label
    df[col][mask] = newLabel

从嵌套字典结构列表（具有两个级别）创建数据框的 Pythonic 方法是什么？

[英]What is the pythonic way to create a Dataframe from a list of Nested Dictionary Structures (with two levels)?

汇总 Pandas DataFrame 中因子水平的差异？

[英]Aggregate over difference of levels of factor in Pandas DataFrame?

使用熊猫计算大型数据框中第n个和第n-1个值之间的差的Python方法？

[英]Pythonic way of calculating difference between nth and n-1th value in a large dataframe using Pandas?

Pythonic方式执行大型案例/开关

[英]Pythonic way to perform a large case/switch

用Python方式将大量行写入文件

[英]Pythonic way to write a large number of lines to a file

做大矩阵的SumProduct的Pythonic方法？

[英]Pythonic way of doing a SumProduct of large matrices?

Pythonic在数据框中的列中创建值对的方法

[英]Pythonic way to create pairs of values in a column in dataframe

在熊猫数据框中解析/拆分URL的Python方式

[英]pythonic way to parse/split URLs in a pandas dataframe

Pythonic将regex应用于数据帧的所有列的方法

[英]Pythonic way of applying regex to all columns of dataframe

Pythonic方式随机分配pandas数据帧条目

[英]Pythonic way to randomly assign pandas dataframe entries

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从嵌套字典结构列表（具有两个级别）创建数据框的 Pythonic 方法是什么？汇总 Pandas DataFrame 中因子水平的差异？使用熊猫计算大型数据框中第n个和第n-1个值之间的差的Python方法？ Pythonic方式执行大型案例/开关用Python方式将大量行写入文件做大矩阵的SumProduct的Pythonic方法？ Pythonic在数据框中的列中创建值对的方法在熊猫数据框中解析/拆分URL的Python方式 Pythonic将regex应用于数据帧的所有列的方法 Pythonic方式随机分配pandas数据帧条目

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM