简体   繁体   English

在 pandas dataframe 中迭代超过 7000 万行的最快方法

[英]Fastest way to iterate over 70 million rows in pandas dataframe

I have two pandas dataframes bookmarks and ratings where columns are respectively:我有两个 pandas 数据框bookmarksratings ,其中列分别为:

  • id_profile , id_item , time_watched id_profileid_itemtime_watched
  • id_profile , id_item , score id_profileid_itemscore

I would like to find score for each couple (profile,item) in the ratings dataframe (set to 0 if does not exist).我想在ratings dataframe 中找到每对夫妇(个人资料,项目)的分数(如果不存在,则设置为 0)。 The problem is the bookmarks dataframe has 73 million rows and it takes so much time (after 15 min the code continues to run).问题是bookmarks dataframe 有 7300 万行并且需要很长时间(15 分钟后代码继续运行)。 I suppose there is a better way to do it.我想有更好的方法来做到这一点。

Here is my code:这是我的代码:

def find_rating(val):
  res = ratings.loc[(ratings['id_profile'] == val[0]) & (ratings['id_asset'] == val[1])]
  if res.empty :
    return 0
  return res['score'].values[0]

arr = bookmarks[['id_profile','id_asset']].values
rates = [find_rating(i) for i in arr]

I work on collab.我在合作上工作。

Do you think I can improve speed of the execution?您认为我可以提高执行速度吗?

Just some thoughts.只是一些想法。 I have not tried such large data in Pandas.我没有在 Pandas 中尝试过这么大的数据。

In pandas, the data is index on row as well as columns.在 pandas 中,数据是行和列的索引。 So if you have 1 million rows with 5 columns ==> we have indexed 5 million records.因此,如果您有 5 列的 100 万行 ==> 我们已经索引了 500 万条记录。

For performance boost,为了提高性能,

  1. Check if you can use Sparse* data structures - https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html检查您是否可以使用 Sparse* 数据结构 - https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html

  2. Filter out the unnecessary data as much feasible.尽可能过滤掉不必要的数据。

  3. If you can, try stick to using only numpy.如果可以,请尝试仅使用 numpy。 As this reduces few features, we also we loose some drag.由于这减少了一些功能,我们也减少了一些阻力。 Worth exploring.值得探索。

  4. Use some distributed multithreaded/multiprocessing tools like Dask\Ray.使用一些分布式多线程/多处理工具,例如 Dask\Ray。 If you have 4 cores -> 4 parallel jobs => 25% faster如果您有 4 个内核 -> 4 个并行作业 => 快 25%

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM