Pandas iterrows 太慢了，我该如何矢量化这段代码？

Question

I'm a Jr. Data Scientist and I'm trying to solve a problem that may be simple for experienced programmers.我是一名 Jr. 数据科学家，我正在尝试解决一个对有经验的程序员来说可能很简单的问题。 I'm dealing with Big Data on GCP and I need to optimize my code.我正在 GCP 上处理大数据，我需要优化我的代码。

                                      [...]
    def send_to_bq(self, df):
        result = []
        for i, row in df[["id", "vectors", "processing_timestamp"]].iterrows():
            data_dict = {
                "processing_timestamp": str(row["processing_timestamp"]),
                "id": row["id"],
                "embeddings_vector": [str(x) for x in row["vectors"]],
            }
            result.append(data_dict)
                                      [...]

Our DataFrame have the following pattern:我们的 DataFrame 具有以下模式：

           id                                               name  \
0  3498001704  roupa natal flanela animais estimacao traje ma...   

                                             vectors  \
0  [0.4021441, 0.45425776, 0.3963987, 0.23765437,...   

        processing_timestamp  
0 2021-10-26 23:48:57.315275

Using iterrows on a DataFrame is too slow.在 DataFrame 上使用 iterrows 太慢了。 I've been studying alternatives and I know that:我一直在研究替代方案，我知道：

I can use apply我可以使用申请
I can vectorize it through Pandas Series (better than apply)我可以通过 Pandas 系列对其进行矢量化（比应用更好）
I can vectorize it through Numpy (better that Pandas vectorization)我可以通过 Numpy 对其进行矢量化（比 Pandas 矢量化更好）
I can use Swifter - which uses apply method and then decides the better solution for you between Dask, Ray and vectorization我可以使用 Swifter - 它使用 apply 方法，然后在 Dask、Ray 和矢量化之间为您决定更好的解决方案

But I don't know how I can transform my code for those solutions.但我不知道如何为这些解决方案转换我的代码。

Can anyone help me demonstrating a solution for my code?谁能帮我演示我的代码的解决方案？ One is enough, but if someone could show more than one solution would be really educational for this matter.一个就足够了，但是如果有人可以展示不止一个解决方案，那么这对这个问题真的很有教育意义。

Any help I will be more than grateful!任何帮助我将不胜感激！

Answer 1

So you basically convert everything to string and then transform your DataFrame to a list of dict因此，您基本上将所有内容都转换为字符串，然后将您的 DataFrame 转换为 dict 列表

For the second part, there is a pandas method to_dict .对于第二部分，有一个熊猫方法to_dict 。 For the first part, I would use astype and apply only to convert the type在第一部分，我会用astype和apply只转换的类型

df["processing_timestamp"] = df["processing_timestamp"].astype(str)
df["embeddings_vector"] = df["vectors"].apply(lambda row: [str(x) for x in row])
result = df[["id", "vectors", "processing_timestamp"]].to_dict('records')

A bit hard to test without sample data but hopefully this helps ;) Also, like I did with the lambda function you could basdically wrap your entire loop body inside an apply , but that would create far to many temporary dicitionaries to be fast.如果没有样本数据，测试有点困难，但希望这会有所帮助；) 另外，就像我使用lambda函数所做的那样，您基本上可以将整个循环体包装在一个apply ，但这会创建许多临时字典以加快速度。

Answer 2

您可以使用pandas.DataFrame方法将其转换为其他类型，比如DataFrame.to_dict()和更多。

Answer 3

You can use agg :您可以使用agg ：

>>> df.agg({'id': str, 'vectors': lambda v: [str(i) for i in v], 
            'processing_timestamp': str}).to_dict('records')

[{'id': '3498001704',
  'vectors': ['0.4021441', '0.45425776', '0.3963987', '0.23765437'],
  'processing_timestamp': '2021-10-26 23:48:57.315275'}]

Pandas iterrows 太慢了，我该如何矢量化这段代码？

问题描述

3 个解决方案

解决方案1
1 2021-10-26 14:22:45

解决方案2
0 2021-10-26 14:32:12

解决方案3
0 2021-10-27 07:14:51

Pandas iterrows 太慢了，我该如何矢量化这段代码？

问题描述

3 个解决方案

解决方案1 1 2021-10-26 14:22:45

解决方案2 0 2021-10-26 14:32:12

解决方案3 0 2021-10-27 07:14:51

解决方案1
1 2021-10-26 14:22:45

解决方案2
0 2021-10-26 14:32:12

解决方案3
0 2021-10-27 07:14:51