简体   繁体   English

Python:在多处理进程之间共享大型数据帧

[英]Python: Sharing large dataframes between multiprocessing process

I am new to Python and multiprocessing.我是 Python 和多处理的新手。 I have to parse 2 large xml (around 6GBs) files into 2 dataframes.我必须将 2 个大型 xml(大约 6GB)文件解析为 2 个数据帧。 both files can be processed independently.这两个文件都可以独立处理。

As far as i learnt in Python, i can do that with multiprocessing.就我在 Python 中学到的而言,我可以通过多处理来做到这一点。 So Process 1 for parsing xml1 file and load it into dataframe Process 2 for parsing xml2 file and load it into dataframe所以流程 1 解析 xml1 文件并将其加载到数据帧 流程 2 解析 xml2 文件并将其加载到数据帧

Now i want to use dataframe generated by process 1 into process 2 Can anyone tell me which is the best way to implement this ?现在我想将进程 1 生成的数据帧用于进程 2 谁能告诉我哪种方法是实现它的最佳方法? My main concern is sharing dataframe between processes.我主要关心的是在进程之间共享数据帧。

Regards Vipul问候 Vipul

You are using multiple processes mainly because you want to read the data in parallel.您使用多个进程主要是因为您想并行读取数据。 Once it has been read in there doesn't seem to be much reason why you should continue with two processes.一旦它被读入,似乎没有什么理由让你继续两个过程。 Ie your processes should read the data, then terminate and let the main process continue.即您的进程应该读取数据,然后终止并让主进程继续。

I would however recommend using multi-threading instead of multi-processing .但是,我建议使用multi-threading而不是multi-processing The differences are not always noticeable, but multi-threading will make it simpler to share global variables between your main thread and the subthreads (I'll explain this below).差异并不总是很明显,但multi-threading将使在主线程和子线程之间共享全局变量变得更简单(我将在下面解释)。 Another advantage of multi-threading is that it doesn't cause the whole application to crash if a single thread fails.多线程的另一个优点是,如果单个线程失败,它不会导致整个应用程序崩溃。 That is not the case with multi-processing.多处理不是这种情况。 See more here: http://net-informations.com/python/iq/multi.htm在此处查看更多信息: http : //net-informations.com/python/iq/multi.htm

It takes some time to get used to how parallelism works in Python, and one of the main considerations you have to pay attention to is how to ensure what you are doing is thread-safe.需要一些时间来习惯 Python 中并行的工作方式,您必须注意的主要考虑因素之一是如何确保您所做的事情是线程安全的。 Usually the mechanism you use for passing data to and from a thread is a queue.通常用于将数据传入和传出线程的机制是队列。 This ensures that only one thread is accessing the same object at any given time.这确保在任何给定时间只有一个线程访问同一个对象。

That being said, in your simple example you could simply define two global variables and start two threads that each read data into one of these global variables (ie no sharing of variables across threads).话虽这么说,在你的简单的例子,你可以简单地定义两个全局变量和启动两个线程,每个数据读入的这些全局变量(即跨线程没有共享的变量)。 You also have to tell your main thread to wait until both your threads have completed before continuing as otherwise the main thread could try to access the data while the subthreads are still working on them.您还必须告诉主线程等待两个线程都完成后再继续,否则主线程可能会在子线程仍在处理数据时尝试访问数据。 (Again, normally you would adopt a queue-based strategy to avoid this problem, but you don't necessarily need that here). (同样,通常您会采用基于队列的策略来避免这个问题,但在这里不一定需要)。

Here is some example code:下面是一些示例代码:

import threading
import pandas as pd
import time

def get_df_1():
    #set the scope of the variable to "global", meaning editing it here, it is edited globally
    global df_1 

    # read in your xml file here (for the example I simply create some dummy data)
    data = [['tom', 10], ['nick', 15], ['juli', 14]]
    df_1 = pd.DataFrame(data, columns=['Name', 'Age'])

    # wait five seconds (for illustration purposes to simulate  working time)
    time.sleep(5)
    print("df_1 fetched")

def get_df_2():
    global df_2
    data = [['tom', 176], ['nick', 182], ['juli', 167]]
    df_2 = pd.DataFrame(data, columns=['Name', 'Height'])
    time.sleep(5)
    print("df_2 fetched")

df_1 = None
df_2 = None

#define threads
t1 = threading.Thread(target=get_df_1)
t2 = threading.Thread(target=get_df_2)

# start threads
t1.start()
t2.start()

#this will print immediately
print("Threads have been started")

# wait until threads finish
t1.join()
t2.join()

#this will only print after the threads are done
print("Threads have finished")

print(df_1)
print(df_2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 Python 中使用多处理 for 循环共享大型 Pandas DataFrame - Sharing large pandas DataFrame with multiprocessing for loop in Python python中的多处理-在多个进程之间共享大对象(例如pandas数据帧) - multiprocessing in python - sharing large object (e.g. pandas dataframe) between multiple processes python 多处理 - 在进程之间共享类字典,随后从进程写入反映到共享 memory - python multiprocessing - Sharing a dictionary of classes between processes with subsequent writes from the process reflected to the shared memory Python在多处理进程之间共享一个双端队列 - Python sharing a deque between multiprocessing processes 带有多处理的Python进程之间共享数据的问题 - Issue with sharing data between Python processes with multiprocessing Python多处理:在进程之间共享数据 - Python multiprocessing: sharing data between processes Python 多处理在进程之间共享变量 - Python Multiprocessing Sharing Variables Between Processes 使用Process的Python多处理:消耗大内存 - Python Multiprocessing using Process: Consuming Large Memory python multiprocessing - 进程挂起加入大队列 - python multiprocessing - process hangs on join for large queue Python:使用多处理程序以块的形式处理大型词典 - Python: Process large dictionary in chunks using multiprocessing
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM