I'm trying to analyze a big csv (the csv is ~3GB and ~6 milions rows) using pandas (my computer has 32GB of RAM memory) and for a series of reasons I cannot load it in chunks. I can read the csv without any problems but as soon as I start to clean the file the whole script crashes. Monitoring the memory usage of my computer I found that just to have the csv stored in a pandas DataFrame 50% of my RAM (18GB) is used. As soon as I start modifying the DataFrame the memory usage skyrockets to 100% and crashed my script. Using the DataFrame methos memory_usage(deep=True)
I find that my DataFrame is 3GB for pandas. But how is it possible that pandas tells me that my variable is 3GB while my memory usage is at 18GB (maybe 13GB since %GB are used by the OS)?
This is an example:
raw = pd.read_csv("db.csv", sep="\t", on_bad_lines="skip", dtypes="object")
remove_invalid_ateco = lambda df_: df_[df_.ateco.str.contains("\.")]
months_diff = lambda a, b: 12 * (a.year - b.dt.year) + (a.month - b.dt.month)
raw.query(
"~piva.isnull() and"
"~code.isnull() and "
"provincia_cd.str.len() == 2"
)
.pipe(remove_invalid_ateco)
.assign(
# Float
roe=lambda df_: df_.roe.str.replace(",", "."),
roi=lambda df_: df_.roi.str.replace(",", "."),
ros=lambda df_: df_.ros.str.replace(",", "."),
longitudine_dd=lambda df_: df_.longitudine_dd.str.replace(",", "."),
latitudine_dd=lambda df_: df_.latitudine_dd.str.replace(",", "."),
sfin=lambda df_: df_.sfin.str.replace(",", "."),
cap_del=lambda df_: df_.cap_del.str.replace(
",", "."
),
cap_sott=lambda df_: df_.cap_sott.str.replace(
",", "."
),
cap_vers=lambda df_: df_.cap_vers.str.replace(",", "."),
eq_ec_1=lambda df_: df_.eq_ec_1.str.replace(",", "."),
eq_eff_1=lambda df_: df_.eq_eff_1.str.replace(",", "."),
eq_fin_1=lambda df_: df_.eq_fin_1.str.replace(",", "."),
eq_liq_1=lambda df_: df_.eq_liq_1.str.replace(",", "."),
eq_pat_1=lambda df_: df_.eq_pat_1.str.replace(",", "."),
# Date
date_iscr=lambda df_: pd.to_datetime(
df_.date_iscr, errors="coerce"
),
date_init=lambda df_: pd.to_datetime(
df_.date_init, errors="coerce"
),
delta=lambda df_: months_diff(
datetime.today(), pd.to_datetime(df_.date_init, errors="coerce")
),
)
I'm going to use a library of mine - convtools , there's Table
helper ( docs | github ) which allows to process table data as a stream.
Caveats:
db.csv
sample, I cannot test it properlypd.read_csv
supports decimal=','
parameter, so better use it. But I'm still replicating your code behavior so you can use it in other cases (eg stripping commas - group separators)on_bad_lines="skip"
.from convtools import conversion as c
from convtools.contrib.tables import Table
float_cols = ["roe", "roi", "ros", "longitudine_dd", "latitudine_dd", "sfin", "cap_del", "cap_sott", "cap_vers", "eq_ec_1", "eq_eff_1", "eq_fin_1", "eq_liq_1", "eq_pat_1"]
date_cols = ["date_iscr", "date_init"]
def months_diff(a, b):
return 12 * (a.year - b.year) + (a.month - b.month)
def parse_datetime(value, default):
try:
return datetime.strptime(value, "%Y-%m-%d")
except (ValueError, TypeError):
return default
rows_iter = (
Table.from_csv(
"db.csv", header=True, dialect=Table.csv_dialect(delimiter="\t")
)
.filter(
c.and_(
c.col("piva").is_not(None),
c.col("code").is_not(None),
c.col("provincia_cd").pipe(len) == 2,
c("\.").not_in(c.col("ateco")),
)
)
.update(
**{
column: c.col(column).call_method("replace", ",", ".")
for column in float_cols
},
**{
column: c.call_func(parse_datetime, c.col(column), default=None)
for column in date_cols
},
)
.update(
delta=c.col("date_init").and_then(
c.call_func(months_diff, date.today(), c.this)
)
)
.into_iter_rows(dict)
)
The result is an iterable of dicts, which you can feed directly to pd.DataFrame
or polars
(whatever).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.