简体   繁体   English

加快大型 arrays 和数据集的操作(熊猫慢,Numpy 更好,进一步改进?)

[英]Speeding up operations on large arrays & datasets (Pandas slow, Numpy better, further improvements?)

I have a large dataset comprising millions of rows and around 6 columns.我有一个包含数百万行和大约 6 列的大型数据集。 The data is currently in a Pandas dataframe and I'm looking for the fastest way to operate on it.数据目前位于 Pandas dataframe 中,我正在寻找对其进行操作的最快方法。 For example, let's say I want to drop all the rows where the value in one column is "1".例如,假设我想删除一列中值为“1”的所有行。

Here's my minimal working example:这是我的最小工作示例:

# Create dummy data arrays and pandas dataframe
array_size = int(5e6)
array1 = np.random.rand(array_size)
array2 = np.random.rand(array_size)
array3 = np.random.rand(array_size)
array_condition = np.random.randint(0, 3, size=array_size)

df = pd.DataFrame({'array_condition': array_condition, 'array1': array1, 'array2': array2, 'array3': array3})

def method1():
    df_new = df.drop(df[df.array_condition == 1].index)

EDIT: As Henry Yik pointed out in the comments, a faster Pandas approach is this:编辑:正如 Henry Yik 在评论中指出的那样,更快的 Pandas 方法是这样的:

def method1b():
    df_new = df[df.array_condition != 1]

I believe that Pandas can be quite slow at this sort of thing, so I also implemented a method using numpy, processing each column as a separate array:我相信 Pandas 在这种事情上可能会很慢,所以我还实现了一个使用 numpy 的方法,将每一列作为一个单独的数组处理:

def method2():
    masking = array_condition != 1
    array1_new = array1[masking]
    array2_new = array2[masking]
    array3_new = array3[masking]
    array_condition_new = array_condition[masking]    

And the results:结果:

%timeit method1()
625 ms ± 7.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit methodb()
158 ms ± 7.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit method2()
138 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So we do see a slight significant performance boost using numpy.因此,我们确实看到使用 numpy 的性能略有显着提升。 However, this is at the cost of much less readable code (ie having to create a mask and apply it to each array).然而,这是以可读性低得多的代码为代价的(即必须创建一个掩码并将其应用于每个数组)。 This method doesn't seem as scalable either as if I have, say, 30 columns of data, I'll need a lot of lines of code that apply the mask to every array, Additionally, it would be useful to allow optional columns.这种方法看起来不像我有 30 列数据那样可扩展,我需要很多代码行来将掩码应用于每个数组,此外,允许可选列会很有用。 so this method may fail trying to operate on arrays which are empty.因此此方法可能无法尝试对空的 arrays 进行操作。

Therefore, I have 2 questions:因此,我有两个问题:

1) Is there a cleaner / more flexible way to implement this in numpy? 1) 在 numpy 中是否有更清洁/更灵活的方法来实现这一点?

2) Or better, is there any higher performance method I could use here? 2)或者更好,我可以在这里使用任何更高性能的方法吗? eg JIT (numba?), Cython or something else?例如 JIT(numba?)、Cython 还是其他?

PS, in practice, in-place operations can be used, replacing the old array with the new one once data is dropped PS,在实践中,可以使用就地操作,一旦数据被丢弃,就用新数组替换旧数组

You may find using numpy.where useful here.您可能会发现使用numpy.where有用。 It converts a Boolean mask to array indices, making life much cheaper.它将 Boolean 掩码转换为数组索引,从而大大降低了成本。 Combining this with numpy.vstack allows for some memory-cheap operations:将其与 numpy.vstack 结合使用可以实现一些内存便宜的操作:

def method3():
    wh = np.where(array_condition == 1)
    return np.vstack(tuple(col[wh] for col in (array1, array2, array3)))

This gives the following timeits:这给出了以下时间:

>>> %timeit method2()
180 ms ± 6.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit method3()
96.9 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Tuple unpacking allows the operation to be fairly light on memory, as when the object is vstack-ed back together, it is smaller.元组解包允许 memory 上的操作相当轻松,因为当 object 被 vstack-ed 重新组合在一起时,它更小。 If you need to get your columns out of a DataFrame directly, the following code snippet may be useful:如果您需要直接从 DataFrame 中取出列,则以下代码段可能有用:

def method3b():
    wh = np.where(array_condition == 1)
    col_names = ['array1','array2','array3']
    return np.vstack(tuple(col[wh] for col in tuple(df[col_name].to_numpy() 
        for col_name in col_names)))

This allows one to grab columns by name from the DataFrame, which are then tuple unpacked on the fly.这允许人们从 DataFrame 中按名称抓取列,然后在运行中对它们进行元组解包。 The speed is about the same:速度差不多:

>>> %timeit method3b()
96.6 ms ± 3.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Enjoy!享受!

Part 1: Pandas and (maybe) Numpy第 1 部分:Pandas 和(也许)Numpy

Compare your method1b and method2 :比较您的method1bmethod2

  • method1b generates a DataFrame , which is probably what you want, method1b生成一个DataFrame ,这可能是你想要的,
  • method2 generates a Numpy array , so to get fully comparable result, you should subsequently generate a DataFrame from it. method2生成一个Numpy 数组,因此要获得完全可比较的结果,您应该随后从中生成一个DataFrame

So I changed your method2 to:所以我将您的方法2更改为:

def method2():
    masking = array_condition != 1
    array1_new = array1[masking]
    array2_new = array2[masking]
    array3_new = array3[masking]
    array_condition_new = array_condition[masking]
    df_new = pd.DataFrame({ 'array_condition': array_condition[masking],
        'array1': array1_new, 'array2': array2_new, 'array3': array3_new})

and then compared execution times (using %timeit ).然后比较执行时间(使用%timeit )。

The result was that my (expanded) version of method2 executed about 5% longer than method1b (check on your own).结果是我的method2 (扩展)版本的执行时间method1b长约5%(请自行检查)。

So my opinion is that as long as a single operation is concerned, it is probably better to stay with Pandas .所以我的观点是,只要是单一操作,可能还是留在Pandas更好。

But if you want to perform on your source DataFrame a couple of operations in sequence and / or you are satisfied with the result as a Numpy array, it is worth to:但是,如果您想在源 DataFrame 上按顺序执行几个操作和/或您对Numpy数组的结果感到满意,那么值得:

  • Call arr = df.values to get the underlying Numpy array.调用arr = df.values以获取底层Numpy数组。
  • Perform all required operations on it using Numpy methods.使用Numpy方法对其执行所有必需的操作。
  • (Optionally) create a DataFrame from the final reslut. (可选)从最终结果创建 DataFrame。

I tried Numpy version of method1b :我试过Numpy版本的method1b

def method3():
    a = df.values
    arr = a[a[:,0] != 1]

but the execution time was about 40 % longer .但执行时间要长约40%

The reason is probably that Numpy array has all elements of the same type, so array_condition column is coerced to float and then the whole Numpy array is created, what takes some time.原因可能是Numpy数组具有相同类型的所有元素,因此array_condition列被强制浮动,然后创建整个Numpy数组,这需要一些时间。

Part 2: Numpy and Numba第 2 部分:Numpy 和 Numba

An alternative to consider is to use Numba package - a Just-In-Time Python compiler.要考虑的替代方法是使用Numba package - 即时 Python 编译器。

I made such test:我做了这样的测试:

Created a Numpy array (as a preliminary step):创建了一个Numpy阵列(作为初步步骤):

a = df.values

The reason is that JIT compiled methods are able to use Numpy methods and types, but not those of Pandas .原因是 JIT 编译的方法可以使用Numpy方法和类型,但不能使用Pandas的方法和类型。

To perform the test, I used almost the same method as above, but with @njit annotation (requires from numba import njit ):为了执行测试,我使用了与上面几乎相同的方法,但使用了@njit注释(需要来自 numba import njit ):

@njit
def method4():
    arr = a[a[:,0] != 1]

This time:这次:

  • The execution time was about 45 % of the time for method1b .执行时间约为method1b时间的 45%。
  • But since a = df.values has been executed before the test loop, there are doubts whether this result is comparable with earlier tests.但是由于a = df.values在测试循环之前已经执行,因此这个结果是否与早期的测试具有可比性存在疑问。

Anyway, try Numba on your own, maybe it will be an interesting option for you.无论如何,自己尝试Numba ,也许这对您来说是一个有趣的选择。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM