简体   繁体   English

使用熊猫有效地计算大型数据帧的每个时间仓的值

[英]Using pandas to count values for each time bin for large dataframe efficiently

I have multiple large dataframes (ca. 3GB csv files with ca. 150 million rows each) that contains Unix-style timestamps and random-generated observation ids.我有多个大型数据帧(大约 3GB csv 文件,每个大约有 1.5 亿行),其中包含 Unix 风格的时间戳和随机生成的观察 ID。 Each observation can/will occur multiple times at different times.每个观察可以/将在不同时间多次发生。 They look like this:它们看起来像这样:

    time_utc    obs_id
0   1564617600  aabthssv
1   1564617601  vvvx7ths
2   1564618501  optnhfsa
3   1564619678  aabthssv
4   1564619998  abtzsnwe
         ...

I now want to in order to analyse the time development of observations get a data frame that contains columns for each observation id and rows for a time bin that can be changed, eg 1 hour, like this:我现在想为了分析观察的时间发展得到一个数据框,其中包含每个观察 id 的列和可以更改的时间段的行,例如 1 小时,如下所示:

time_bin aabthssv vvvx7ths optnhfsa  ...
1               1        1        1
2               1        0        0
               ...

I have tried to do this by creating a numpy array of timestamp start points and then adding value_counts for a selection of all rows in that bin to a new, empty dataframe.我试图通过创建一个时间戳起始点的 numpy 数组,然后将该 bin 中所有行的选择的 value_counts 添加到一个新的空数据帧来做到这一点。 This runs into MemoryError.这会遇到 MemoryError。 I have tried pre-cleaning more, but even reducing the size of the raw data by a third (so 2GB, 100 million rows) still has Memory Errors occuring.我尝试了更多的预清理,但即使将原始数据的大小减少了三分之一(即 2GB,1 亿行),仍然会发生内存错误。

SLICE_SIZE = 3600 # example value of 1h
slice_startpoints = np.arange(START_TIME, END_TIME+1-SLICE_SIZE, SLICE_SIZE)
agg_df = pd.DataFrame()

for timeslice in slice_startpoints:
        temp_slice = raw_data[raw_data['time_utc'].between(timeslice, timeslice + SLICE_SIZE)]
        temp_counts = temp_slice['obs_id'].value_counts()
        agg_df = agg_df.append(temp_counts)
        temp_index = raw_data[raw_data['time_utc'].between(timeslice, timeslice + SLICE_SIZE)].index
        raw_data.drop(temp_index, inplace=True)

Is there a way to do this more efficiently or rather so that it works at all?有没有办法更有效地做到这一点,或者更确切地说,它完全有效?

Edit: I found my efficient way to do it based on the suggestion to use crosstab.编辑:我根据使用交叉表的建议找到了我的有效方法。 The file size did not need to be reduced.文件大小不需要减少。 Using the following code resulted in exactly the result I was looking for.使用以下代码产生了我正在寻找的结果。

df['binned'] = pd.cut(df['time_utc'], bins=slice_startpoints, include_lowest=True, labels=slice_startpoints[1:])
df.groupby('binned')['obs_id'].value_counts().unstack().fillna(0)

You can try cut with crosstab :您可以尝试使用crosstab cut

slice_startpoints = np.arange(START_TIME, END_TIME+SLICE_SIZE, SLICE_SIZE)
print (slice_startpoints)

df['binned'] = pd.cut(df['time_utc'], 
                      bins=slice_startpoints, 
                      include_lowest=True,
                      labels=slice_startpoints[1:])

df = pd.crosstab(df['binned'], df['obs_id'])

You could read in a large .csv with the 'chunk' iterator, and then perform the calculation on the chunk instead of the the entire .csv file.您可以使用“块”迭代器读取大型 .csv,然后对块而不是整个 .csv 文件执行计算。 The chunksize defines the number of rows in a single chunk. chunksize 定义了单个块中的行数。 That way, you have a good handle to control the memory usage.这样,您就可以很好地控制内存使用。 The downside will be that you will have to add some logic that will merge the results of the chunks.缺点是您必须添加一些逻辑来合并块的结果。

import pandas as pd
df_chunk = pd.read_csv('file.csv', chunksize=1000)
for chunk in df_chunk:
    print(chunk)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM