在 pandas dataframe 中迭代超过 7000 万行的最快方法

Question

I have two pandas dataframes bookmarks and ratings where columns are respectively:我有两个 pandas 数据框bookmarks和ratings ，其中列分别为：

id_profile , id_item , time_watched id_profile ， id_item ， time_watched
id_profile , id_item , score id_profile ， id_item ， score

I would like to find score for each couple (profile,item) in the ratings dataframe (set to 0 if does not exist).我想在ratings dataframe 中找到每对夫妇（个人资料，项目）的分数（如果不存在，则设置为 0）。 The problem is the bookmarks dataframe has 73 million rows and it takes so much time (after 15 min the code continues to run).问题是bookmarks dataframe 有 7300 万行并且需要很长时间（15 分钟后代码继续运行）。 I suppose there is a better way to do it.我想有更好的方法来做到这一点。

Here is my code:这是我的代码：

def find_rating(val):
  res = ratings.loc[(ratings['id_profile'] == val[0]) & (ratings['id_asset'] == val[1])]
  if res.empty :
    return 0
  return res['score'].values[0]

arr = bookmarks[['id_profile','id_asset']].values
rates = [find_rating(i) for i in arr]

I work on collab.我在合作上工作。

Do you think I can improve speed of the execution?您认为我可以提高执行速度吗？

Answer 1

Just some thoughts.只是一些想法。 I have not tried such large data in Pandas.我没有在 Pandas 中尝试过这么大的数据。

In pandas, the data is index on row as well as columns.在 pandas 中，数据是行和列的索引。 So if you have 1 million rows with 5 columns ==> we have indexed 5 million records.因此，如果您有 5 列的 100 万行 ==> 我们已经索引了 500 万条记录。

For performance boost,为了提高性能，

Check if you can use Sparse* data structures - https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html检查您是否可以使用 Sparse* 数据结构 - https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html
Filter out the unnecessary data as much feasible.尽可能过滤掉不必要的数据。
If you can, try stick to using only numpy.如果可以，请尝试仅使用 numpy。 As this reduces few features, we also we loose some drag.由于这减少了一些功能，我们也减少了一些阻力。 Worth exploring.值得探索。
Use some distributed multithreaded/multiprocessing tools like Dask\Ray.使用一些分布式多线程/多处理工具，例如 Dask\Ray。 If you have 4 cores -> 4 parallel jobs => 25% faster如果您有 4 个内核 -> 4 个并行作业 => 快 25%

在 pandas dataframe 中迭代超过 7000 万行的最快方法

问题描述

1 个解决方案

解决方案1
0 2020-12-09 10:51:55

在 pandas dataframe 中迭代超过 7000 万行的最快方法

问题描述

1 个解决方案

解决方案1 0 2020-12-09 10:51:55

解决方案1
0 2020-12-09 10:51:55