[英]Pandas iterrows too slow, how can I vectorize this code?
I'm a Jr. Data Scientist and I'm trying to solve a problem that may be simple for experienced programmers.我是一名 Jr. 数据科学家,我正在尝试解决一个对有经验的程序员来说可能很简单的问题。 I'm dealing with Big Data on GCP and I need to optimize my code.
我正在 GCP 上处理大数据,我需要优化我的代码。
[...]
def send_to_bq(self, df):
result = []
for i, row in df[["id", "vectors", "processing_timestamp"]].iterrows():
data_dict = {
"processing_timestamp": str(row["processing_timestamp"]),
"id": row["id"],
"embeddings_vector": [str(x) for x in row["vectors"]],
}
result.append(data_dict)
[...]
Our DataFrame have the following pattern:我们的 DataFrame 具有以下模式:
id name \
0 3498001704 roupa natal flanela animais estimacao traje ma...
vectors \
0 [0.4021441, 0.45425776, 0.3963987, 0.23765437,...
processing_timestamp
0 2021-10-26 23:48:57.315275
Using iterrows on a DataFrame is too slow.在 DataFrame 上使用 iterrows 太慢了。 I've been studying alternatives and I know that:
我一直在研究替代方案,我知道:
But I don't know how I can transform my code for those solutions.但我不知道如何为这些解决方案转换我的代码。
Can anyone help me demonstrating a solution for my code?谁能帮我演示我的代码的解决方案? One is enough, but if someone could show more than one solution would be really educational for this matter.
一个就足够了,但是如果有人可以展示不止一个解决方案,那么这对这个问题真的很有教育意义。
Any help I will be more than grateful!任何帮助我将不胜感激!
So you basically convert everything to string and then transform your DataFrame to a list of dict因此,您基本上将所有内容都转换为字符串,然后将您的 DataFrame 转换为 dict 列表
For the second part, there is a pandas method to_dict
.对于第二部分,有一个熊猫方法
to_dict
。 For the first part, I would use astype
and apply
only to convert the type在第一部分,我会用
astype
和apply
只转换的类型
df["processing_timestamp"] = df["processing_timestamp"].astype(str)
df["embeddings_vector"] = df["vectors"].apply(lambda row: [str(x) for x in row])
result = df[["id", "vectors", "processing_timestamp"]].to_dict('records')
A bit hard to test without sample data but hopefully this helps ;) Also, like I did with the lambda
function you could basdically wrap your entire loop body inside an apply
, but that would create far to many temporary dicitionaries to be fast.如果没有样本数据,测试有点困难,但希望这会有所帮助;) 另外,就像我使用
lambda
函数所做的那样,您基本上可以将整个循环体包装在一个apply
,但这会创建许多临时字典以加快速度。
您可以使用pandas.DataFrame
方法将其转换为其他类型,比如DataFrame.to_dict()
和更多。
You can use agg
:您可以使用
agg
:
>>> df.agg({'id': str, 'vectors': lambda v: [str(i) for i in v],
'processing_timestamp': str}).to_dict('records')
[{'id': '3498001704',
'vectors': ['0.4021441', '0.45425776', '0.3963987', '0.23765437'],
'processing_timestamp': '2021-10-26 23:48:57.315275'}]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.