简体   繁体   English

Python共享读取内存

[英]Python shared read memory

I'm working with a data set that is ~ 8GB big and I'm also using scikit-learn to train various ML models on it. 我正在使用大约8GB的数据集,并且还在使用scikit-learn在其上训练各种ML模型。 The data set is basically a list of 1D vectors of ints. 数据集基本上是一维int向量的列表。

How can I make the data set available to multiple python processes or how can I encode the data set so I can make it use multiprocessing 's classes? 如何使数据集可用于多个python进程或如何对数据集进行编码,以便可以使用multiprocessing的类? I've been reading on ctypes and I've also been reading into multiprocessing 's documentation but I'm very confused. 我一直在阅读ctypes ,也一直在阅读multiprocessing的文档,但是我很困惑。 I only need to make the data readable to every process so I can train the models with it. 我只需要使数据对每个过程都可读,就可以使用它来训练模型。

Do I need to have the shared multiprocessing variables as ctypes? 我是否需要将共享的multiprocessing变量作为ctypes?

How can I represent the dataset as ctypes ? 如何将数据集表示为ctypes

I am assuming you are able to load the whole dataset into RAM in a numpy array, and that you are working on Linux or a Mac. 我假设您可以将整个数据集以numpy数组的形式加载到RAM中,并且您正在Linux或Mac上工作。 (If you are on Windows or you can't fit the array into RAM, then you should probably copy the array to a file on disk and use numpy.memmap to access it. Your computer will cache the data from disk into RAM as well as it can, and those caches will be shared between processes, so it's not a terrible solution.) (如果您使用的是Windows,或者无法将阵列放入RAM中,则应该将阵列复制到磁盘上的文件中,并使用numpy.memmap进行访问。您的计算机也会将磁盘中的数据缓存到RAM中尽可能,并且这些缓存将在进程之间共享,所以这不是一个糟糕的解决方案。)

Under the assumptions above, if you need read-only access to the dataset in other processes created via multiprocessing , you can simply create the dataset and then launch the other processes. 在上述假设下,如果您需要对通过multiprocessing创建的其他进程中的数据集进行只读访问,则只需创建数据集,然后启动其他进程即可。 They will have read-only access to data from the original namespace. 他们将具有对原始名称空间中数据的只读访问权限。 They can alter data from the original namespace, but those changes won't be visible to other processes (the memory manager will copy each segment of memory they alter into the local memory map). 它们可以更改原始名称空间中的数据,但是其他进程将看不到这些更改(内存管理器会将其更改的每个内存段都复制到本地内存映射中)。

If your other processes need to alter the original dataset and make those changes visible to the parent process or other processes, you could use something like this: 如果您的其他流程需要更改原始数据集,并使这些更改对父流程或其他流程可见,则可以使用以下方法:

import multiprocessing
import numpy as np

# create your big dataset
big_data = np.zeros((3, 3))

# create a shared-memory wrapper for big_data's underlying data
# (it doesn't matter what datatype we use, and 'c' is easiest)
# I think if lock=True, you get a serialized object, which you don't want.
# Note: you will need to setup your own method to synchronize access to big_data.
buf = multiprocessing.Array('c', big_data.data, lock=False)

# at this point, buf and big_data.data point to the same block of memory, 
# (try looking at id(buf[0]) and id(big_data.data[0])) but for some reason
# changes aren't propagated between them unless you do the following:
big_data.data = buf

# now you can update big_data from any process:
def add_one_direct():
    big_data[:] = big_data + 1

def add_one(a):
    # People say this won't work, since Process() will pickle the argument.
    # But in my experience Process() seems to pass the argument via shared
    # memory, so it works OK.
    a[:] = a+1

print "starting value:"
print big_data

p = multiprocessing.Process(target=add_one_direct)
p.start()
p.join()

print "after add_one_direct():"
print big_data

p = multiprocessing.Process(target=add_one, args=(big_data,))
p.start()
p.join()

print "after add_one():"
print big_data

Might be duplicate of Share Large, Read-Only Numpy Array Between Multiprocessing Processes 可能是多处理进程之间共享大型只读的Numpy数组的副本

You could convert your dataset from current representation to new numpy memmap object, and use it from every process. 您可以将数据集从当前表示形式转换为新的numpy memmap对象,并在每个过程中使用它。 But it won't be very fast anyway, it just gives some abstraction of working with array from ram, in reality it will be file from HDD, partially cached in RAM. 但是无论如何它不会很快,它只是提供了一些与ram一起使用数组的方法,实际上,它将是来自HDD的文件,部分缓存在RAM中。 So you should prefer scikit-learn algos with partial_fit methods, and use them. 因此,您应该更喜欢使用带有part_fit方法的scikit学习算法,并使用它们。

https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html

Actually joblib (which is used in scikit-learn for parallelizing) automatically converts your dataset to memmap representation to use it from different processes (If it's big enough, of course). 实际上,joblib(在scikit-learn中用于并行化)会自动将您的数据集转换为memmap表示形式,以便在不同进程中使用它(当然,如果足够大的话)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM