通過行中非空元素的計數對PySpark Dataframe進行統一分區

Question

我知道有一千個問題與如何最好地通過密鑰等對你的DataFrames或RDD進行分區有關，但我認為這種情況不同，足以保證自己的問題。

我正在PySpark中構建協同過濾推薦引擎，這意味着需要比較每個用戶（行）的唯一項目評級。 因此，對於維度為M (rows) x N (columns)的DataFrame ，這意味着數據集變為M x (K choose 2)其中K << N是用戶的非空（即，評級）元素的數量。

我的算法對於用戶評估了大約一定數量項目的數據集非常有效。 但是，對於一部分用戶評估了大量項目（數量級比同一分區中的其他用戶大）的情況，我的數據變得非常偏斜，最后幾個分區開始花費大量時間。 舉一個簡單的例子，考慮以下DataFrame ：

cols = ['id', 'Toy Story', 'UP', 'Die Hard', 'MIB', 'The Shining']
ratings = [
    (1, 4.5,  3.5,  None, 1.0,  None),  # user 1
    (2, 2.0,  None, 5.0,  4.0,  3.0),   # user 2
    (3, 3.5,  5.0,  1.0,  None, 1.0),   # user 3
    (4, None, None, 4.5,  3.5,  4.0),   # user 4
    (5, None, None, None, None, 4.5)    # user 5
]

sc.parallelize(ratings, 2).toDF(cols)

我的情況出現在這個DataFrame （約1,000,000個用戶和~10k項目）的更大變體中，其中一些用戶對電影的評分比其他用戶大得多 。 最初，我將我的DataFrame如下：

def _make_ratings(row):
    import numpy as np
    non_null_mask = ~np.isnan(row)
    idcs = np.where(non_null_mask)[0]  # extract the non-null index mask

    # zip the non-null idcs with the corresponding ratings
    rtgs = row[non_null_mask]
    return list(zip(idcs, rtgs))


def as_array(partition):
    import numpy as np
    for row in partition:
        yield _make_ratings(np.asarray(row, dtype=np.float32))


# drop the id column, get the RDD, and make the copy of np.ndarrays
ratings = R.drop('id').rdd\
           .mapPartitions(as_array)\
           .cache()

然后，我可以通過以下方式檢查每個分區所需的相互評級對的數量：

n_choose_2 = (lambda itrbl: (len(itrbl) * (len(itrbl) - 1)) / 2.)
sorted(ratings.map(n_choose_2).glom().map(sum).collect(), reverse=True)

最初，這是我得到的每個分區的相互評級對的分布：

如您所見，這只是不可擴展。 所以我第一次嘗試解決這個問題就是在源頭更智能地划分數據幀。 我想出了以下函數，它將隨機分割我的數據框行：

def shuffle_partition(X, n_partitions, col_name='shuffle'):
    from pyspark.sql.functions import rand
    X2 = X.withColumn(col_name, rand())
    return X2.repartition(n_partitions, col_name).drop(col_name)

這很有效。 應用之后，這是新的發行版：

這肯定會更好，但仍然不是我喜歡的。 必須有一種方法可以在分區之間更均勻地分發這些“功率評估者”，但我無法弄明白。 我一直在考慮按“每個用戶的評級數”列進行分區，但這最終會將所有高評級用戶集中在一起，而不是將它們分開。

我錯過了一些明顯的東西嗎

更新

我在以下函數中實現了igrinis的解決方案（我確信有一個更優雅的方式來編寫它，但我不是非常熟悉DataFrame API，所以我回到RDD這個 - 批評歡迎），但是發行版與原作大致相同，所以不確定我是否做錯了......：

def partition_by_rating_density(X, id_col_name, n_partitions,
                                partition_col_name='partition'):
    """Segment partitions by rating density. Partitions will be more
    evenly distributed based on the number of ratings for each user.

    Parameters
    ----------
    X : PySpark DataFrame
        The ratings matrix

    id_col_name : str
        The ID column name

    n_partitions : int
        The number of partitions in the new DataFrame.

    partition_col_name : str
        The name of the partitioning column

    Returns
    -------
    with_partition_key : PySpark DataFrame
        The partitioned DataFrame
    """
    ididx = X.columns.index(id_col_name)

    def count_non_null(row):
        sm = sum(1 if v is not None else 0
                 for i, v in enumerate(row) if i != ididx)
        return row[ididx], sm

    # add the count as the last element and id as the first
    counted = X.rdd.map(count_non_null)\
               .sortBy(lambda r: r[-1], ascending=False)

    # get the count array out, zip it with the index, and then flatMap
    # it out to get the sorted index
    indexed = counted.zipWithIndex()\
                     .map(lambda ti: (ti[0][0], ti[1] % n_partitions))\
                     .toDF([id_col_name, partition_col_name])

    # join back with indexed, which now has the partition column
    counted_indexed = X.join(indexed, on=id_col_name, how='inner')

    # the columns to drop
    return counted_indexed.repartition(n_partitions, partition_col_name)\
        .drop(partition_col_name)

Answer 1

您可以做的是按照評級數獲取用戶的排序列表，然后將列中的索引除以分區數。 將除法的其余部分作為列，然后使用該列上的partitionBy()重新partitionBy() 。 這樣，您的分區將具有幾乎相同的所有用戶評級計數表示。

對於3個分區，這將為您提供：

[1000, 800, 700, 600, 200, 30, 10, 5] - number of ratings
[   0,   1,   2,   3,   4,  5,  6, 7] - position in sorted index
[   0,   1,   2,   0,   1,  2,  0, 1] - group to partition by

通過行中非空元素的計數對PySpark Dataframe進行統一分區

問題描述

更新

1 個解決方案

解決方案1
8 已采納 2017-09-23 18:42:50

通過行中非空元素的計數對PySpark Dataframe進行統一分區

問題描述

更新

1 個解決方案

解決方案1 8 已采納 2017-09-23 18:42:50

解決方案1
8 已采納 2017-09-23 18:42:50