Dask / Pandas是否支持根据依赖其他行的复杂条件删除组中的行？

Question

I'm processing a bunch of text-based records in csv format using Dask, which I am learning to use to work around too large to fit in memory problems, and I'm trying to filter records within groups that best match a complicated criteria. 我正在使用Dask处理csv格式的一堆基于文本的记录，我正在学习使用它来解决太大而无法容纳内存的问题，并且我试图在最符合复杂条件的组中过滤记录。

The best approach I've identified to approach this so far is to basically use Dash to group records in bite sized chunks and then write the applicable logic in Python: 到目前为止，我确定的最佳方法是基本上使用Dash将记录按字节大小分组，然后在Python中编写适用的逻辑：

def reduce_frame(partition):
    records = partition.to_dict('record')
    shortlisted_records = []

    # Use Python to locate promising looking records.
    # Some of the criteria can be cythonized; one criteria
    # revolves around whether record is a parent or child
    # of records in shortlisted_records.
    for other in shortlisted_records:
        if other.path.startswith(record.path) \
            or record.path.startswith(other.path):
            ... # keep one, possibly both
        ...

    return pd.DataFrame.from_dict(shortlisted_records)

df = df.groupby('key').apply(reduce_frame, meta={...})

In case it matters, the complicated criteria revolves around weeding out promising looking links on a web page based on link url, link text, and css selectors across the entire group. 万一重要，复杂的标准围绕着基于整个组中的链接URL，链接文本和CSS选择器在网页上淘汰前景看好的链接。 Think with given A, and B in shortlist, and C a new record, keep all if each are very very promising, else prefer C over A and/or B if more promising than either or both, else drop C. The resulting Pandas partition objects above are tiny. 考虑给定的A和B，以及C的新记录，如果每个都非常有前途，则保留所有记录；否则，如果比一个或两个都更有前途，则比A和/或B更喜欢C，否则放弃C。上面的物体很小。 (The dataset in its entirety is not, hence my using Dask.) （数据集不是全部，因此我使用的是Dask。）

Seeing how Pandas functionality exposes inherently row- and column-based functionality, I'm struggling to imagine any vectorized approach to solving this, so I'm exploring writing the logic in Python. 看到Pandas功能如何固有地暴露基于行和列的功能，我正在努力想像任何矢量化方法来解决此问题，因此我正在探索用Python编写逻辑。

Is the above the correct way to proceed, or are there more Dask/Pandas idiomatic ways - or simply better ways - to approach this type of problem? 以上是正确的进行方法，还是有更多的Dask / Pandas惯用方法-或只是更好的方法来解决此类问题？ Ideally one that allows to parallelize the computations across a cluster? 理想情况下，它允许跨集群并行计算？ For instance by using Dask.bag or Dask.delayed and/or cytoolz or something else I might have missed while learning Python? 例如，通过使用Dask.bag或Dask.delayed和/或cytoolz或其他一些我在学习Python时可能会错过的东西？

Answer 1

I know nothing about Dask , but can tell a little on passing / blocking some rows using Pandas . 我对Dask一无所知 ，但是可以告诉我们使用Pandas传递/阻止某些行的方法。

It is possible to use groupby(...).apply(...) to "filter" the source DataFrame. 可以使用groupby（...）。apply（...）来“过滤”源DataFrame。

Example: df.groupby('key').apply(lambda grp: grp.head(2)) returns first 2 rows from each group. 示例： df.groupby('key').apply(lambda grp: grp.head(2))返回每个组的前2行。

In your case, write a function to applied to each group, which: 在您的情况下，编写一个要应用于每个组的函数，该函数：

contains some logic, processing the current group, 包含一些逻辑，处理当前组，
generates the output DataFrame, based on this logic, eg returning only some of input rows. 根据此逻辑生成输出DataFrame，例如仅返回一些输入行。

The returned rows are then concatenated, forming the result of apply . 然后将返回的行连接起来，形成apply的结果。

Another possibility is to use groupby(...).filter(...) , but in this case the underlying function returns a decision "passing" or "blocking" each group of rows. 另一种可能性是使用groupby（...）。filter（...） ，但是在这种情况下，基础函数将返回“通过”或“阻塞”每组行的决策。

Yet another possibility is to define a "filtering function", say filtFun , which returns True (pass the row) or False (block the row). 还有一种可能是定义一个“过滤函数”，例如filtFun ，该函数返回True （通过该行）或False （阻止该行）。

Then: 然后：

Run: msk = df.apply(filtFun, axis=1) to generate a mask (which rows passed the filter). 运行： msk = df.apply(filtFun, axis=1)生成一个掩码（哪些行通过了过滤器）。
In further processing use df[msk] , ie only these rows which passed the filter. 在进一步处理中，使用df[msk] ，即仅通过过滤器的这些行。

But in this case the underlying function has acces only to the current row, not to the whole group of rows. 但是在这种情况下，基础功能仅访问当前行，而不访问整个行组。

Dask / Pandas是否支持根据依赖其他行的复杂条件删除组中的行？

问题描述

1 个解决方案

解决方案1
1 2019-07-24 17:02:51

Dask / Pandas是否支持根据依赖其他行的复杂条件删除组中的行？

问题描述

1 个解决方案

解决方案1 1 2019-07-24 17:02:51

解决方案1
1 2019-07-24 17:02:51