简体   繁体   English

Python Pandas-有效地连接两个Pandas列

[英]Python Pandas - Concatenate two Pandas column Efficiently

I am looking for the most memory efficient way to concatenate an Int 32 and Datetime column to create a 3rd column. 我正在寻找最有效的内存方式来连接Int 32和Datetime列以创建第3列。 I have two columns in a Dataframe an int32 and a datetime64. 我在Dataframe中有两列int32和datetime64。 I want to create a 3rd column which will . 我想创建第3列,它将。

The dataframe looks like this: 数据框如下所示:

在此处输入图片说明

What I want is: 我想要的是:

在此处输入图片说明

I have created a test data frame as follows: 我创建了一个测试数据框架,如下所示:

import pandas as pd
import numpy as np
import sys
import datetime as dt
%load_ext memory_profiler
np.random.seed(42)
df_rows = 10**6
todays_date = dt.datetime.now().date()
dt_array = pd.date_range(todays_date - dt.timedelta(2*365), periods=2*365, freq='D')  
cust_id_array = np.random.randint(100000,999999,size=(100000, 1))
df = pd.DataFrame({'cust_id':np.random.choice(cust_id_array.flatten(),df_rows,replace=True)
                  ,'tran_dt':np.random.choice(dt_array,df_rows,replace=True)})
df.info()

The dataframe statistics as-is before concatenation are: 串联之前的数据帧统计信息如下: 在此处输入图片说明

I have used both map and astype to concatenate but the memory usage is still quite high: 我已经使用了map和astype来串联,但是内存使用率仍然很高:

%memit -r 1 df['comb_key'] = df["cust_id"].map(str) + '----' + df["tran_dt"].map(str)

%memit -r 1 df['comb_key'] = df["cust_id"].astype(str) + '----' + df["tran_dt"].astype(str)

%memit -r 1 df['comb_key'] = df.apply(lambda x:  str(str(x['cust_id']) \
+ '----' + dt.datetime.strftime(x['tran_dt'],'%Y-%m-%d')), axis=1)

The memory usage for the 3 are: 3个的内存使用情况是: 在此处输入图片说明

Is there a more memory efficient way of doing this? 有没有更有效的内存方式来做到这一点? My real life data sets are about 1.8 GB's uncompressed on a machine with 16GB RAM. 我的真实数据集是在具有16GB RAM的计算机上未压缩的约1.8GB。

df['comb_key'] = df["cust_id"].astype(str) + '----' + df["tran_dt"].astype(str)

is computationally the fastest method, as you're effectively only performing one typecast for each element of data, and pretty much all of this takes place in C. 它是计算上最快的方法,因为您实际上仅对数据的每个元素执行一种类型转换,并且几乎所有这些操作都在C中进行。

So if you're running into memory issues, you'll have to do this in sections, for example with 2: 因此,如果遇到内存问题,则必须分节进行,例如:2:

%%memit
df['comb_key'] = ''
df.comb_key.update(df["cust_id"].iloc[:500000].astype(str) + '----' + df["tran_dt"].iloc[:500000].astype(str))
df.comb_key.update(df["cust_id"].iloc[500000:].astype(str) + '----' + df["tran_dt"].iloc[500000:].astype(str))

# peak memory: 253.06 MiB, increment: 63.25 MiB

Note that the new column consumes 65MB of memory: 请注意,新列消耗65MB的内存:

df.memory_usage(deep=True)

# Index             72
# cust_id      8000000
# tran_dt      8000000
# comb_key    65000000
# dtype: int64

So make sure that you have enough memory to store the result in the first place! 因此,请确保您有足够的内存来首先存储结果! However, it's probably important to note that if you're having memory issues performing this operation but somehow just enough to store the result, chances are that you won't have enough memory left to do more work on your dataframe either. 但是,可能很重要的一点是要注意,如果执行此操作时遇到内存问题,但以某种方式足以存储结果,则很可能没有足够的内存来对数据帧进行更多工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM