简体   繁体   English

Dask / Pandas是否支持根据依赖其他行的复杂条件删除组中的行?

[英]Does Dask/Pandas support removing rows in a group based on complex conditions that rely on other rows?

I'm processing a bunch of text-based records in csv format using Dask, which I am learning to use to work around too large to fit in memory problems, and I'm trying to filter records within groups that best match a complicated criteria. 我正在使用Dask处理csv格式的一堆基于文本的记录,我正在学习使用它来解决太大而无法容纳内存的问题,并且我试图在最符合复杂条件的组中过滤记录。

The best approach I've identified to approach this so far is to basically use Dash to group records in bite sized chunks and then write the applicable logic in Python: 到目前为止,我确定的最佳方法是基本上使用Dash将记录按字节大小分组,然后在Python中编写适用的逻辑:

def reduce_frame(partition):
    records = partition.to_dict('record')
    shortlisted_records = []

    # Use Python to locate promising looking records.
    # Some of the criteria can be cythonized; one criteria
    # revolves around whether record is a parent or child
    # of records in shortlisted_records.
    for other in shortlisted_records:
        if other.path.startswith(record.path) \
            or record.path.startswith(other.path):
            ... # keep one, possibly both
        ...

    return pd.DataFrame.from_dict(shortlisted_records)

df = df.groupby('key').apply(reduce_frame, meta={...})

In case it matters, the complicated criteria revolves around weeding out promising looking links on a web page based on link url, link text, and css selectors across the entire group. 万一重要,复杂的标准围绕着基于整个组中的链接URL,链接文本和CSS选择器在网页上淘汰前景看好的链接。 Think with given A, and B in shortlist, and C a new record, keep all if each are very very promising, else prefer C over A and/or B if more promising than either or both, else drop C. The resulting Pandas partition objects above are tiny. 考虑给定的A和B,以及C的新记录,如果每个都非常有前途,则保留所有记录;否则,如果比一个或两个都更有前途,则比A和/或B更喜欢C,否则放弃C。上面的物体很小。 (The dataset in its entirety is not, hence my using Dask.) (数据集不是全部,因此我使用的是Dask。)

Seeing how Pandas functionality exposes inherently row- and column-based functionality, I'm struggling to imagine any vectorized approach to solving this, so I'm exploring writing the logic in Python. 看到Pandas功能如何固有地暴露基于行和列的功能,我正在努力想像任何矢量化方法来解决此问题,因此我正在探索用Python编写逻辑。

Is the above the correct way to proceed, or are there more Dask/Pandas idiomatic ways - or simply better ways - to approach this type of problem? 以上是正确的进行方法,还是有更多的Dask / Pandas惯用方法-或只是更好的方法来解决此类问题? Ideally one that allows to parallelize the computations across a cluster? 理想情况下,它允许跨集群并行计算? For instance by using Dask.bag or Dask.delayed and/or cytoolz or something else I might have missed while learning Python? 例如,通过使用Dask.bagDask.delayed和/或cytoolz或其他一些我在学习Python时可能会错过的东西?

I know nothing about Dask , but can tell a little on passing / blocking some rows using Pandas . 我对Dask一无所知 ,但是可以告诉我们使用Pandas传递/阻止某些行的方法。

It is possible to use groupby(...).apply(...) to "filter" the source DataFrame. 可以使用groupby(...)。apply(...)来“过滤”源DataFrame。

Example: df.groupby('key').apply(lambda grp: grp.head(2)) returns first 2 rows from each group. 示例: df.groupby('key').apply(lambda grp: grp.head(2))返回每个组的前2行。

In your case, write a function to applied to each group, which: 在您的情况下,编写一个要应用于每个组的函数,该函数:

  • contains some logic, processing the current group, 包含一些逻辑,处理当前组,
  • generates the output DataFrame, based on this logic, eg returning only some of input rows. 根据此逻辑生成输出DataFrame,例如仅返回一些输入行。

The returned rows are then concatenated, forming the result of apply . 然后将返回的行连接起来,形成apply的结果。

Another possibility is to use groupby(...).filter(...) , but in this case the underlying function returns a decision "passing" or "blocking" each group of rows. 另一种可能性是使用groupby(...)。filter(...) ,但是在这种情况下,基础函数将返回“通过”或“阻塞”每组行的决策。

Yet another possibility is to define a "filtering function", say filtFun , which returns True (pass the row) or False (block the row). 还有一种可能是定义一个“过滤函数”,例如filtFun ,该函数返回True (通过该行)或False (阻止该行)。

Then: 然后:

  • Run: msk = df.apply(filtFun, axis=1) to generate a mask (which rows passed the filter). 运行: msk = df.apply(filtFun, axis=1)生成一个掩码(哪些行通过了过滤器)。
  • In further processing use df[msk] , ie only these rows which passed the filter. 在进一步处理中,使用df[msk] ,即仅通过过滤器的这些行。

But in this case the underlying function has acces only to the current row, not to the whole group of rows. 但是在这种情况下,基础功能仅访问当前行,而不访问整个行组。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM