Ray 比 Python 和.multiprocessing 都慢得多

Question

I upload 130k json files.我上传了 130k json 文件。

I do this with Python :我用Python做到这一点：

import os
import json
import pandas as pd

path = "/my_path/"

filename_ending = '.json'


json_list = []

json_files = [file for file in os.listdir(f"{path}") if file.endswith(filename_ending)]

import time
start = time.time()

for jf in json_files:
    with open(f"{path}/{jf}", 'r') as f:

        json_data = json.load(f)

        json_list.append(json_data)

end = time.time()

and it takes 60 seconds.它需要60秒。

I do this with multiprocessing :我用multiprocessing做到这一点：

import os
import json
import pandas as pd
from multiprocessing import Pool
import time

path = "/my_path/"

filename_ending = '.json'

json_files = [file for file in os.listdir(f"{path}") if file.endswith(filename_ending)]


def read_data(name):
    with open(f"/my_path/{name}", 'r') as f:
        json_data = json.load(f)

    return json_data


if __name__ == '__main__':

    start = time.time()

    pool = Pool(processes=os.cpu_count())                       
    x = pool.map(read_data, json_files)     

    end = time.time()

and it takes 53 seconds.它需要 53 秒。

I do this with ray :我用ray做这个：

import os
import json
import pandas as pd
from multiprocessing import Pool
import time
import ray


path = "/my_path/"

filename_ending = '.json'

json_files = [file for file in os.listdir(f"{path}") if file.endswith(filename_ending)]

start = time.time()

ray.shutdown()
ray.init(num_cpus=os.cpu_count()-1)

@ray.remote    
def read_data(name):
    with open(f"/my_path/{name}", 'r') as f:
        json_data = json.load(f)

    return json_data

all_data = []
for jf in json_files:
    all_data.append(read_data.remote(jf))


final = ray.get(all_data)

end = time.time()

and it takes 146 seconds.它需要 146 秒。

My question is why ray takes so much time?我的问题是为什么ray需要这么多时间？

Is it because:是不是因为：

1) ray is relatively slow for relatively small amount of data? 1）对于相对少量的数据来说，光线相对较慢？

2) I am doing something wrong in my code? 2）我在我的代码中做错了什么？

3) ray is not that useful? 3） ray不是很有用吗？

Answer 1

I never used ray, but I'm quite confident, that my explanation should be right.我从未使用过 ray，但我很有信心，我的解释应该是正确的。

The original code does a simple json deserialisation.原始代码做了一个简单的 json 反序列化。 The code requires mostly file IO and just a little bit of CPU.该代码主要需要文件 IO 和一点点 CPU。 (json deserialisation is rather quick, that's one of the reasons why json is a popular exchange format) （json 反序列化相当快，这也是为什么 json 是一种流行的交换格式的原因之一）

Ray has to push the data from one process to the other (if distributed over multiple machines via the network). Ray 必须将数据从一个进程推送到另一个进程（如果通过网络分布在多台机器上）。 In order to do so it is performing some serialisation / deserialisation by itself (perhaps it's using pickle and a robust TCP protocol to push params and to collect results).为了做到这一点，它自己执行一些序列化/反序列化（也许它使用泡菜和强大的 TCP 协议来推送参数和收集结果）。 and probably this overhead is bigger then the work the actual task takes.并且可能这种开销大于实际任务所需的工作。

If you would do some more calculations with the json data (anything that is more CPU intensive), then you would be able to see a difference.如果您对 json 数据（任何 CPU 密集型更高的数据）进行更多计算，那么您将能够看到差异。

My guess is, that your example problem is too simple and thus ray's overhead exceeds the benefice of using multiple workers.我的猜测是，您的示例问题太简单了，因此 ray 的开销超过了使用多个工作人员的好处。

In other words.换句话说。 It costs more time / energy to distribute the tasks and to collect the results than it actually takes to perform calculate the result.分配任务和收集结果所花费的时间/精力比实际执行计算结果所花费的时间/精力更多。

Answer 2

I would say that hypothesis 1) is probably the closest to the truth.我会说假设 1) 可能是最接近事实的。 Ray seems like a powerful library, but all you're doing is reading a bunch of files. Ray 似乎是一个强大的库，但您所做的只是读取一堆文件。 Is your code just an example for the sake of benchmarking, or part of some larger program?您的代码只是为了进行基准测试，还是某个更大程序的一部分？ If it is the latter, then it might be interesting to have your benchmark code reflect that.如果是后者，那么让您的基准代码反映这一点可能会很有趣。

It's nothing huge, but I tweaked your 3 programs so they should be at least slightly more efficient.这没什么大不了的，但我调整了你的 3 个程序，所以它们至少应该稍微更有效率。

import os
import json


folder_path = "/my_path/"
filename_ending = '.json'

json_files = (os.path.join(folder_path, fp) for fp in os.listdir(f"{folder_path}") if fp.endswith(filename_ending))


def load_json_from_file(file_path):
    with open(file_path, 'r') as file_1:
        return json.load(file_1)


json_list = [load_json_from_file(curr_fp) for curr_fp in json_files]

import os
import json
import multiprocessing as mp


folder_path = "/my_path/"
filename_ending = '.json'

json_files = (os.path.join(folder_path, fp) for fp in os.listdir(f"{folder_path}") if fp.endswith(filename_ending))


def load_json_from_file(file_path):
    with open(file_path, 'r') as file_1:
        return json.load(file_1)


with mp.Pool() as pool:       
    json_list = pool.map(load_json_from_file, json_files)

import os
import json
import ray

folder_path = "/my_path/"
filename_ending = '.json'


@ray.remote
def load_json_from_file(file_path):
    with open(file_path, 'r') as file_1:
        return json.load(file_1)


json_files = (os.path.join(folder_path, fp) for fp in os.listdir(f"{folder_path}") if fp.endswith(filename_ending))

ray.init()

futures_list = [load_json_from_file.remote(curr_fp) for curr_fp in json_files]

json_list = ray.get(futures_list)

Let me know if you have any questions.如果您有任何问题，请告诉我。 If you can run the benchmarks again, I would love to know what difference, if any, there is.如果您可以再次运行基准测试，我很想知道有什么区别（如果有的话）。

Ray 比 Python 和.multiprocessing 都慢得多

问题描述

2 个解决方案

解决方案1
1 2019-11-05 01:44:24

解决方案2
1 2019-11-05 03:03:14

Ray 比 Python 和.multiprocessing 都慢得多

问题描述

2 个解决方案

解决方案1 1 2019-11-05 01:44:24

解决方案2 1 2019-11-05 03:03:14

解决方案1
1 2019-11-05 01:44:24

解决方案2
1 2019-11-05 03:03:14