如何使用print语句来调试在分区中执行的函数

Question

import numpy as np
import pandas as pd
import sparkobj as spk

from sklearn.datasets import make_blobs
from sklearn.ensemble import IsolationForest

def train_forest_per_partition_map_step(partition):
    print('partition')
    print(partition)
    get_data = np.asarray(list(partition))
    assert get_data.shape[1] == 2
    return [IsolationForest(n_estimators=100,
                            contamination=0.15,
                            random_state=666).fit(get_data)]

def main():
    spark = spk.getsparkobj()
    n_samples = 300
    outliers_fraction = 0.15
    n_outliers = int(outliers_fraction * n_samples)
    n_inliers = n_samples - n_outliers
    rng = np.random.RandomState(666)

    data = pd.DataFrame(data=np.concatenate([make_blobs(centers=[[0, 0], columns=["feat_1", "feat_2"]) # skipping some unrelevant

    df = spark.createDataFrame(data=data)
    df = df.rdd.repartition(numPartitions=3).toDF()
    forest = df.rdd.mapPartitions(f=train_forest_per_partition_map_step).collect()
    lines = df.rdd.collect().foreach(println)

    # Reduce step: Combine scores from partitions.
    forest[0].decision_function(data) # Partition 1 Isolation forest.
    forest[1].decision_function(data) # Partition 2 Isolation forest.
    forest[2].decision_function(data) # Partition 3 Isolation forest.       

if __name__ == '__main__':
    main()

Is there a way to get the print results in the function "train_forest_per_partition_map_step" after the partitions has been executed? 有没有办法在执行分区后在“train_forest_per_partition_map_step”函数中获取打印结果？ I have tried df.rdd.collect().foreach(println) but keep getting attribute error 我已经尝试过df.rdd.collect（）。foreach（println）但仍然遇到属性错误

AttributeError: 'list' object has no attribute 'foreach'
AttributeError                            Traceback (most recent call last)
in engine
      1 if __name__ == '__main__':
----> 2     main()

<ipython-input-1-c5cff78d4b35> in main()
     25 
     26     forest = df.rdd.mapPartitions(f=train_forest_per_partition_map_step).collect()
---> 27     lines  = df.rdd.take(100).foreach(println)
     28     
     29     # Reduce step: Combine scores from partitions.

AttributeError: 'list' object has no attribute 'foreach'

Guess that's only available for scala but would like to know the Python equivalent 猜猜这只适用于scala，但想知道Python等价物

Answer 1

Instead of print go for show,collect,count methods. 而不是打印去show,collect,count方法。 This will make the process execute at that very point 这将使该过程在该点执行

df.show()
df.filter("your_clause").collect()
df.count()

Let me know if this is what you were looking for 如果这是您正在寻找的，请告诉我

如何使用print语句来调试在分区中执行的函数

问题描述

1 个解决方案

解决方案1
0 2019-05-16 08:41:38

如何使用print语句来调试在分区中执行的函数

问题描述

1 个解决方案

解决方案1 0 2019-05-16 08:41:38

解决方案1
0 2019-05-16 08:41:38