简体   繁体   中英

Pandas memory usage and memory allocation

I'm trying to analyze a big csv (the csv is ~3GB and ~6 milions rows) using pandas (my computer has 32GB of RAM memory) and for a series of reasons I cannot load it in chunks. I can read the csv without any problems but as soon as I start to clean the file the whole script crashes. Monitoring the memory usage of my computer I found that just to have the csv stored in a pandas DataFrame 50% of my RAM (18GB) is used. As soon as I start modifying the DataFrame the memory usage skyrockets to 100% and crashed my script. Using the DataFrame methos memory_usage(deep=True) I find that my DataFrame is 3GB for pandas. But how is it possible that pandas tells me that my variable is 3GB while my memory usage is at 18GB (maybe 13GB since %GB are used by the OS)?

This is an example:

raw = pd.read_csv("db.csv", sep="\t", on_bad_lines="skip", dtypes="object")

remove_invalid_ateco = lambda df_: df_[df_.ateco.str.contains("\.")]
months_diff = lambda a, b: 12 * (a.year - b.dt.year) + (a.month - b.dt.month)
raw.query(
            "~piva.isnull() and"
            "~code.isnull() and "
            "provincia_cd.str.len() == 2"
        )
        .pipe(remove_invalid_ateco)
        .assign(
            # Float
            roe=lambda df_: df_.roe.str.replace(",", "."),
            roi=lambda df_: df_.roi.str.replace(",", "."),
            ros=lambda df_: df_.ros.str.replace(",", "."),
            longitudine_dd=lambda df_: df_.longitudine_dd.str.replace(",", "."),
            latitudine_dd=lambda df_: df_.latitudine_dd.str.replace(",", "."),
            sfin=lambda df_: df_.sfin.str.replace(",", "."),
            cap_del=lambda df_: df_.cap_del.str.replace(
                ",", "."
            ),
            cap_sott=lambda df_: df_.cap_sott.str.replace(
                ",", "."
            ),
            cap_vers=lambda df_: df_.cap_vers.str.replace(",", "."),
            eq_ec_1=lambda df_: df_.eq_ec_1.str.replace(",", "."),
            eq_eff_1=lambda df_: df_.eq_eff_1.str.replace(",", "."),
            eq_fin_1=lambda df_: df_.eq_fin_1.str.replace(",", "."),
            eq_liq_1=lambda df_: df_.eq_liq_1.str.replace(",", "."),
            eq_pat_1=lambda df_: df_.eq_pat_1.str.replace(",", "."),
            # Date
            date_iscr=lambda df_: pd.to_datetime(
                df_.date_iscr, errors="coerce"
            ),
            date_init=lambda df_: pd.to_datetime(
                df_.date_init, errors="coerce"
            ),
            delta=lambda df_: months_diff(
                datetime.today(), pd.to_datetime(df_.date_init, errors="coerce")
            ),
        )

I'm going to use a library of mine - convtools , there's Table helper ( docs | github ) which allows to process table data as a stream.

Caveats:

  1. since there's no db.csv sample, I cannot test it properly
  2. pd.read_csv supports decimal=',' parameter, so better use it. But I'm still replicating your code behavior so you can use it in other cases (eg stripping commas - group separators)
  3. the code bellow doesn't do any on_bad_lines="skip" .
from convtools import conversion as c
from convtools.contrib.tables import Table


float_cols = ["roe", "roi", "ros", "longitudine_dd", "latitudine_dd", "sfin", "cap_del", "cap_sott", "cap_vers", "eq_ec_1", "eq_eff_1", "eq_fin_1", "eq_liq_1", "eq_pat_1"]
date_cols = ["date_iscr", "date_init"]


def months_diff(a, b):
    return 12 * (a.year - b.year) + (a.month - b.month)


def parse_datetime(value, default):
    try:
        return datetime.strptime(value, "%Y-%m-%d")
    except (ValueError, TypeError):
        return default


rows_iter = (
    Table.from_csv(
        "db.csv", header=True, dialect=Table.csv_dialect(delimiter="\t")
    )
    .filter(
        c.and_(
            c.col("piva").is_not(None),
            c.col("code").is_not(None),
            c.col("provincia_cd").pipe(len) == 2,
            c("\.").not_in(c.col("ateco")),
        )
    )
    .update(
        **{
            column: c.col(column).call_method("replace", ",", ".")
            for column in float_cols
        },
        **{
            column: c.call_func(parse_datetime, c.col(column), default=None)
            for column in date_cols
        },
    )
    .update(
        delta=c.col("date_init").and_then(
            c.call_func(months_diff, date.today(), c.this)
        )
    )
    .into_iter_rows(dict)
)

The result is an iterable of dicts, which you can feed directly to pd.DataFrame or polars (whatever).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM