简体   繁体   中英

Python Pandas - Concatenate two Pandas column Efficiently

I am looking for the most memory efficient way to concatenate an Int 32 and Datetime column to create a 3rd column. I have two columns in a Dataframe an int32 and a datetime64. I want to create a 3rd column which will .

The dataframe looks like this:

在此处输入图片说明

What I want is:

在此处输入图片说明

I have created a test data frame as follows:

import pandas as pd
import numpy as np
import sys
import datetime as dt
%load_ext memory_profiler
np.random.seed(42)
df_rows = 10**6
todays_date = dt.datetime.now().date()
dt_array = pd.date_range(todays_date - dt.timedelta(2*365), periods=2*365, freq='D')  
cust_id_array = np.random.randint(100000,999999,size=(100000, 1))
df = pd.DataFrame({'cust_id':np.random.choice(cust_id_array.flatten(),df_rows,replace=True)
                  ,'tran_dt':np.random.choice(dt_array,df_rows,replace=True)})
df.info()

The dataframe statistics as-is before concatenation are: 在此处输入图片说明

I have used both map and astype to concatenate but the memory usage is still quite high:

%memit -r 1 df['comb_key'] = df["cust_id"].map(str) + '----' + df["tran_dt"].map(str)

%memit -r 1 df['comb_key'] = df["cust_id"].astype(str) + '----' + df["tran_dt"].astype(str)

%memit -r 1 df['comb_key'] = df.apply(lambda x:  str(str(x['cust_id']) \
+ '----' + dt.datetime.strftime(x['tran_dt'],'%Y-%m-%d')), axis=1)

The memory usage for the 3 are: 在此处输入图片说明

Is there a more memory efficient way of doing this? My real life data sets are about 1.8 GB's uncompressed on a machine with 16GB RAM.

df['comb_key'] = df["cust_id"].astype(str) + '----' + df["tran_dt"].astype(str)

is computationally the fastest method, as you're effectively only performing one typecast for each element of data, and pretty much all of this takes place in C.

So if you're running into memory issues, you'll have to do this in sections, for example with 2:

%%memit
df['comb_key'] = ''
df.comb_key.update(df["cust_id"].iloc[:500000].astype(str) + '----' + df["tran_dt"].iloc[:500000].astype(str))
df.comb_key.update(df["cust_id"].iloc[500000:].astype(str) + '----' + df["tran_dt"].iloc[500000:].astype(str))

# peak memory: 253.06 MiB, increment: 63.25 MiB

Note that the new column consumes 65MB of memory:

df.memory_usage(deep=True)

# Index             72
# cust_id      8000000
# tran_dt      8000000
# comb_key    65000000
# dtype: int64

So make sure that you have enough memory to store the result in the first place! However, it's probably important to note that if you're having memory issues performing this operation but somehow just enough to store the result, chances are that you won't have enough memory left to do more work on your dataframe either.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM