简体   繁体   English

使用值填充非常大的数据帧的快速方法

[英]fast way of populating a very large dataframe with values

I have a very large dataframe which has 100 years of dates as column headers (ie ~36500 columns) and 100 years of dates as indices (ie ~36500 rows). 我有一个非常大的数据框,它有100年的日期作为列标题(即~36500列)和100年的日期作为索引(即~36500行)。 I have a function which calculates a value for each of the elements of the dataframe which will need to be run 36500^2 times. 我有一个函数,它计算数据帧的每个元素的值,需要运行36500 ^ 2次。

Ok, the problem is not the function which is quite fast, but rather assignment of values to the dataframe. 好吧,问题不是功能非常快,而是将值分配给数据帧。 It takes about 1 sec per 6 assignments even if I assign a constant this way. 即使我以这种方式分配常数,每6个分配大约需要1秒。 Obviously I'm pretty thick as you can tell: 你可以说,显然我很厚

for i, row in df_mBase.iterrows():
    for idx, val in enumerate(row):
        df_mBase.ix[i][idx] = 1
    print(i)

Ordinarily in C/Java i would simply loop through a 36500x36500 double loop and access the preassigned memory directly via indexing which can be achieved in constant time with virtually no overhead. 通常在C / Java中,我只需循环遍历36500x36500双循环并直接通过索引访问预分配的内存,这可以在几乎没有开销的情况下在恒定时间内实现。 But this appears to be not an option in python? 但这似乎不是python中的一个选项?

What would be the fastest way to store this data in a dataframe? 将这些数据存储在数据框中的最快方法是什么? Pythonian or not, I'm after speed only - I dont care for elegance. 蟒蛇人与否,我只追求速度 - 我不关心优雅。

there are a few reasons why this might slow 这可能会减慢的原因有几个

.ix .IX

.ix is a magic type indexer, which can do both label and positional indexing, but will be deprecated for the stricter .loc for label-based and .iloc for index-based. .ix是一个神奇的类型索引器,可以同时执行标签和位置索引,但对于基于标签的更严格的.loc和基于索引的.iloc ,将弃用 .iloc I assume .ix does a lot of magic behind the scenes to figure out whether label or location-based indexing is needed 我假设.ix在幕后做了很多魔术,以确定是否需要标签或基于位置的索引

.iterrows .iterrows

returns a (new?) Series for each row. 为每一行返回一个(new?) Series Column-based iteration might be faster, as .iteritems to iterate over the columns 基于列的迭代可能更快,因为.iteritems迭代列

[][] [] []

df_mBase.ix[i][idx] returns a Series , and then takes element idx from it, which gets assigned the value 1. df_mBase.ix[i][idx]返回一个Series ,然后从中获取元素idx ,它被赋值为1。

df_mBase.loc[i, idx] = 1

should improve this 应该改善这一点

benchmarking 标杆

import pandas as pd

import itertools
import timeit


def generate_dummy_data(years=1):
    period = pd.Timedelta(365 * years, unit='D')

    start = pd.Timestamp('19000101')
    offset = pd.Timedelta(10, unit='h')

    dates1 = pd.DatetimeIndex(start=start, end=start + period, freq='d')
    dates2 = pd.DatetimeIndex(start=start + offset, end=start + offset + period, freq='d')

    return pd.DataFrame(index=dates1, columns=dates2, dtype=float)


def assign_original(df_orig):
    df_new = df_orig.copy(deep=True)
    for i, row in df_new.iterrows():
        for idx, val in enumerate(row):
            df_new.ix[i][idx] = 1
    return df_new


def assign_other(df_orig):
    df_new = df_orig.copy(deep=True)
    for (i, idx_i), (j, idx_j) in itertools.product(enumerate(df_new.index), enumerate(df_new.columns)):
        df_new[idx_j][idx_i] = 1
    return df_new


def assign_loc(df_orig):
    df_new = df_orig.copy(deep=True)
    for i, row in df_new.iterrows():
        for idx, val in enumerate(row):
            df_new.loc[i][idx] = 1
    return df_new


def assign_loc_product(df_orig):
    df_new = df_orig.copy(deep=True)
    for i, j in itertools.product(df_new.index, df_new.columns):
        df_new.loc[i, j] = 1
    return df_new


def assign_iloc_product(df_orig):
    df_new = df_orig.copy(deep=True)
    for (i, idx_i), (j, idx_j) in itertools.product(enumerate(df_new.index), enumerate(df_new.columns)):
        df_new.iloc[i, j] = 1
    return df_new


def assign_iloc_product_range(df_orig):
    df_new = df_orig.copy(deep=True)
    for i, j in itertools.product(range(len(df_new.index)), range(len(df_new.columns))):
        df_new.iloc[i, j] = 1
    return df_new


def assign_index(df_orig):
    df_new = df_orig.copy(deep=True)
    for (i, idx_i), (j, idx_j) in itertools.product(enumerate(df_new.index), enumerate(df_new.columns)):
        df_new[idx_j][idx_i] = 1
    return df_new


def assign_column(df_orig):
    df_new = df_orig.copy(deep=True)
    for c, column in df_new.iteritems():
        for idx, val in enumerate(column):
            df_new[c][idx] = 1
    return df_new


def assign_column2(df_orig):
    df_new = df_orig.copy(deep=True)
    for c, column in df_new.iteritems():
        for idx, val in enumerate(column):
            column[idx] = 1
    return df_new


def assign_itertuples(df_orig):
    df_new = df_orig.copy(deep=True)
    for i, row in enumerate(df_new.itertuples()):
        for idx, val in enumerate(row[1:]):
            df_new.iloc[i, idx] = 1
    return df_new


def assign_applymap(df_orig):
    df_new = df_orig.copy(deep=True)
    df_new = df_new.applymap(lambda x: 1)
    return df_new


def assign_vectorized(df_orig):
    df_new = df_orig.copy(deep=True)
    for i in df_new:
        df_new[i] = 1
    return df_new


methods = [
    ('assign_original', assign_original),
    ('assign_loc', assign_loc),
    ('assign_loc_product', assign_loc_product),
    ('assign_iloc_product', assign_iloc_product),
    ('assign_iloc_product_range', assign_iloc_product_range),
    ('assign_index', assign_index),
    ('assign_column', assign_column),
    ('assign_column2', assign_column2),
    ('assign_itertuples', assign_itertuples),
    ('assign_vectorized', assign_vectorized),
    ('assign_applymap', assign_applymap),
]


def get_timings(period=1, methods=()):
    print('=' * 10)
    print(f'generating timings for a period of {period} years')
    df_orig = generate_dummy_data(period)
    df_orig.info(verbose=False)
    repeats = 1
    for method_name, method in methods:
        result = pd.DataFrame()

        def my_method():
            """
            This looks a bit icky, but is the best way I found to make sure the values are really changed,
            and not just on a copy of a DataFrame
            """
            nonlocal result
            result = method(df_orig)

        t = timeit.Timer(my_method).timeit(number=repeats)

        assert result.iloc[3, 3] == 1

        print(f'{method_name} took {t / repeats} seconds')
        yield (method_name, {'time': t, 'memory': result.memory_usage(deep=True).sum()/1024})


periods = [0.03, 0.1, 0.3, 1, 3]


results = {period: dict(get_timings(period, methods)) for period in periods}

print(results)

timings_dict = {period: {k: v['time'] for k, v in result.items()} for period, result in results.items()}

df = pd.DataFrame.from_dict(timings_dict)
df.transpose().plot(logy=True).figure.savefig('test.png')
  0.03 0.1 0.3 1.0 3.0 assign_applymap 0.001989 0.009862 0.018018 0.105569 0.549511 assign_vectorized 0.002974 0.008428 0.035994 0.162565 3.810138 assign_index 0.013717 0.137134 1.288852 14.190128 111.102662 assign_column2 0.026260 0.186588 1.664345 19.204453 143.103077 assign_column 0.016811 0.212158 1.838733 21.053627 153.827845 assign_itertuples 0.025130 0.249886 2.125968 24.639593 185.975111 assign_iloc_product_range 0.026982 0.247069 2.199019 23.902244 186.548500 assign_iloc_product 0.021225 0.233454 2.437183 25.143673 218.849143 assign_loc_product 0.018743 0.290104 2.515379 32.778794 258.244436 assign_loc 0.029050 0.349551 2.822797 32.087433 294.052933 assign_original 0.034315 0.337207 2.714154 30.361072 332.327008 

Conclusion 结论

时序图

If you can use vectorization, do so. 如果可以使用矢量化,请执行此操作。 Depending on the calculation, you can use another method. 根据计算,您可以使用其他方法。 If yo only need the value that is used, the applymap seems fastest. 如果你只需要使用的值,那么applymap似乎最快。 If you need the index and-or column too, work with the columns 如果您还需要索引和/或列,请使用列

If you can't vectorize, df[column][index] = x works fastest, with iterating over the columns with df.iteritems() as a close second 如果你不能矢量化, df[column][index] = x工作得最快,用df.iteritems()作为关闭秒迭代列

You should create the data structure either in native python or in numpy and pass the data to a the DataFrame constructor. 您应该在本机python或numpy中创建数据结构,并将数据传递给DataFrame构造函数。 If your function can be written using numpy's function/operation, then you can use the vectorized nature of numpy to avoid looping over all indices. 如果您的函数可以使用numpy的函数/操作编写,那么您可以使用numpy的向量化特性来避免遍历所有索引。

Here is an example with a made up function: 这是一个具有补充功能的示例:

import numpy as np
import pandas as pd
import datetime as dt
import dateutil as du

dates = [dt.date(2017, 1, 1) - du.relativedelta.relativedelta(days=i) for i in range(36500)]
data = np.zeros((36500,36500), dtype=np.uint8)

def my_func(i, j):
    return (sum(divmod(i,j)) - sum(divmod(j,i))) % 255

for i in range(1, 36500):
    for j in range(1, 36500):
        data[i,j] = my_func(i,j)

df = pd.DataFrame(data, columns=dates, index=dates)

df.head(5)
#returns:

            2017-08-21  2017-08-20  2017-08-19  2017-08-18  2017-08-17  \
2017-08-21           0           0           0           0           0
2017-08-20           0           0         254         253         252
2017-08-19           0           1           0           0           0
2017-08-18           0           2           0           0           1
2017-08-17           0           3           0         254           0

               ...      1917-09-19  1917-09-18  1917-09-17  1917-09-16
2017-08-21     ...               0           0           0           0
2017-08-20     ...             225         224         223         222
2017-08-19     ...             114         113         113         112
2017-08-18     ...              77          76          77          76
2017-08-17     ...              60          59          58          57

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM