简体   繁体   English

从带有描述的 Numpy nd 数组创建 Pandas DataFrame 的更快方法?

[英]Faster way to create Pandas DataFrame from a Numpy nd array with descriptions?

I would like to transform a numpy nd array with dimension descriptions into a pandas dataframe.我想将带有维度描述的 numpy nd 数组转换为 pandas dataframe。 The following solutions works, but seems a bit slow for 360000 rows (1.5s on my machine, your results may differ).以下解决方案有效,但对于 360000 行似乎有点慢(在我的机器上为 1.5 秒,您的结果可能会有所不同)。

import pandas as pd
import numpy as np
from itertools import product
import time

# preparation of data
nd_data = np.random.random((5, 3, 100, 10, 4, 6))
dimension_descriptions = {
    'floaty': [0.1,0.2,0.3,0.4,0.5],
    'animal': ['ducks', 'horses', 'elephants'],
    'ramp': range(100),
    'another_ramp': range(10),
    'interesting number': [12, 15, 29, 42],
    'because': ['why', 'is', 'six', 'afraid', 'of', 'seven']
}

t_start = time.time()
# create dataframe from list of dictionairies containing data and permuted descriptions
df = pd.DataFrame([{**{'data': data}, **dict(zip(dimension_descriptions.keys(), permuted_description))}
                   for data, permuted_description in zip(nd_data.flatten(), product(*dimension_descriptions.values()))])
print(f'elapsed time: {time.time()- t_start:.1f}s')

Is there a faster way to achieve the same result?有没有更快的方法来达到相同的结果?

On my machine, I put the original way to create the df in a function and timed it.在我的机器上,我将创建 df 的原始方法放在 function 中并对其进行计时。

def create_df1(nd_data, dimension_descriptions):
    return pd.DataFrame([{**{'data': data}, **dict(zip(dimension_descriptions.keys(), permuted_description))}
                   for data, permuted_description in zip(nd_data.flatten(), product(*dimension_descriptions.values()))])

%timeit create_df1(nd_data, dimension_descriptions)
991 ms ± 37.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can avoid creating a temporary dict and adding that to a new dict by just assigning the nd_data to the DataFrame after the original permuted data is first created.您可以避免创建临时字典并将其添加到新字典,只需在首次创建原始置换数据后将 nd_data 分配给nd_data This gives a slight boost in speed.这会稍微提高速度。

def create_df2(nd_data, dimension_descriptions):
    df = pd.DataFrame([dict(zip(dimension_descriptions.keys(), permuted_description))
                       for permuted_description in product(*dimension_descriptions.values())])
    df["data"] = nd_data.flatten()
    return df

%timeit create_df2(nd_data, dimension_descriptions)
822 ms ± 42.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

If you need the data column to be the first column in the dataframe, you can use df.insert(0, "data", nd_data.flatten()) instead, which gets similar speed results on my machine.如果您需要将data列作为 dataframe 中的第一列,则可以使用df.insert(0, "data", nd_data.flatten())代替,它在我的机器上获得类似的速度结果。

It also seems wasteful to create a dict with the same column names every time.每次创建具有相同列名的字典似乎也很浪费。 Pandas offers a way to avoid this by allowing you to pass in the list of columns as a separate argument, and you can pass the data in as a list of lists instead. Pandas 提供了一种避免这种情况的方法,它允许您将列列表作为单独的参数传递,并且您可以将数据作为列表列表传递。 This saves a lot of time.这可以节省很多时间。

def create_df3(nd_data, dimension_descriptions):
    df = pd.DataFrame(list(product(*dimension_descriptions.values())), columns=dimension_descriptions.keys())
    df["data"] = nd_data.flatten()
    return df

%timeit create_df3(nd_data, dimension_descriptions)
281 ms ± 9.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM