并行化 DataFrame 定制 Function Dask

Question

I am trying to use Dask to speed up a Python DataFrame for loop operation via Dask's multi-processing features.我正在尝试使用 Dask 来加速 Python DataFrame 通过 Dask 的多处理功能进行循环操作。 I am fully aware the for-looping dataframes is generally not best practice, but in my case, it is required.我完全意识到 for-looping 数据帧通常不是最佳实践，但在我的情况下，它是必需的。 I have read pretty extensively through the documentation and other similar questions, but I cannot seem to figure my problem out.我已经通过文档和其他类似问题进行了广泛阅读，但我似乎无法弄清楚我的问题。

df.head()
         Title                                                                                                                                       Content
0  Lizzibtz     @Ontario2020 @Travisdhanraj @fordnation Maybe.  They are not adding to the stress of education during Covid. Texas sample.  Plus…  
1  Jess 🌱🛹🏳️‍🌈  @BetoORourke So ashamed at how Abbott has not handled COVID in Texas. A majority of our large cities are hot spots with no end in sight.    
2  sidi diallo  New post (PVC Working Gloves) has been published on Covid-19 News Info - Texas test                    
3  Kautillya    @PandaJay What was the need to go to SC for yatra anyway? Isn't covid cases spiking exponentially? Ambubachi mela o… texas
4  SarahLou♡    RT @BenJolly9: 23rd June 2020 was the day Sir Keir Starmer let the Tories off the hook for their miss-handling of COVID-19. texas

I have a custom python function defined as:我有一个自定义 python function 定义为：

def locMp(df):
    hitList = []
    for i in range(len(df)):
        print(i)
        string = df.iloc[i]['Content']
        # print(string)
        doc = nlp(string)
        ents = [e.text for e in doc.ents if e.label_ == "GPE"]
        x = np.array(ents)
        print(np.unique(x))
        hitList.append(np.unique(x))

    df['Locations'] = hitList
    return df

This function adds a dataframe column of locations extracted from a library called spacy - I do not think that is important, but I want you to see the whole function.这个 function 添加了从名为 spacy 的库中提取的位置的 dataframe 列 - 我认为这并不重要，但我希望你看到整个 ZC1C425268E68385D1AB5074C17A94F1。

Now, via the documentation and a few other questions out there.现在，通过文档和其他一些问题。 The way to use Dask's multiprocessing for a dataframe is to create a Dask dataframe, partition it, map_partitions , and .compute() .对 dataframe 使用 Dask 的多处理的方法是创建一个 Dask dataframe，对其进行分区， map_partitions和.compute() 。 So, I have tried the following and some other options with no luck:因此，我尝试了以下和其他一些选项，但没有成功：

part = 7
ddf = dd.from_pandas(df, npartitions=part)
location = ddf.map_partitions(lambda df: df.apply(locMp), meta=pd.DataFrame).compute()

# and...

part = 7
ddf = dd.from_pandas(df, npartitions=part)
location = ddf.map_partitions(locMp, meta=pd.DataFrame).compute()

# and simplifying from Dask documentation

part = 7
ddf = dd.from_pandas(df, npartitions=part)
location = ddf.map_partitions(locMp)

I have tried a few other things with dask.delayed but nothing seems to work.我用dask.delayed尝试了其他一些东西，但似乎没有任何效果。 I either get a Dask Series or some other undesired output OR the function takes as long as or longer than just running it regularly.我要么得到一个 Dask 系列或其他一些不受欢迎的 output 要么 function 需要的时间与定期运行它一样长或更长。 How can I use Dask to speed up custom DataFrame function operations and return a clean Pandas Dataframe?如何使用 Dask 加快自定义 DataFrame function 操作并返回干净的 Pandas ZC699575A5E8AFD1BE2BA？

Thank you谢谢

Answer 1

You could try letting Dask handle the application instead of doing the looping yourself:您可以尝试让 Dask 处理应用程序，而不是自己进行循环：

ddf["Locations"] = ddf["Content"].apply(
    lambda string: [e.text for e in nlp(string).ents if e.label_ == "GPE"],
    meta=("Content", "object"))

并行化 DataFrame 定制 Function Dask

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-07-01 21:24:20

并行化 DataFrame 定制 Function Dask

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-07-01 21:24:20

解决方案1
1 已采纳 2020-07-01 21:24:20