简体   繁体   English

如何使用print语句来调试在分区中执行的函数

[英]How to use print statement to debug function executed in partition

import numpy as np
import pandas as pd
import sparkobj as spk

from sklearn.datasets import make_blobs
from sklearn.ensemble import IsolationForest

def train_forest_per_partition_map_step(partition):
    print('partition')
    print(partition)
    get_data = np.asarray(list(partition))
    assert get_data.shape[1] == 2
    return [IsolationForest(n_estimators=100,
                            contamination=0.15,
                            random_state=666).fit(get_data)]

def main():
    spark = spk.getsparkobj()
    n_samples = 300
    outliers_fraction = 0.15
    n_outliers = int(outliers_fraction * n_samples)
    n_inliers = n_samples - n_outliers
    rng = np.random.RandomState(666)

    data = pd.DataFrame(data=np.concatenate([make_blobs(centers=[[0, 0], columns=["feat_1", "feat_2"]) # skipping some unrelevant

    df = spark.createDataFrame(data=data)
    df = df.rdd.repartition(numPartitions=3).toDF()
    forest = df.rdd.mapPartitions(f=train_forest_per_partition_map_step).collect()
    lines = df.rdd.collect().foreach(println)

    # Reduce step: Combine scores from partitions.
    forest[0].decision_function(data) # Partition 1 Isolation forest.
    forest[1].decision_function(data) # Partition 2 Isolation forest.
    forest[2].decision_function(data) # Partition 3 Isolation forest.       

if __name__ == '__main__':
    main()

Is there a way to get the print results in the function "train_forest_per_partition_map_step" after the partitions has been executed? 有没有办法在执行分区后在“train_forest_per_partition_map_step”函数中获取打印结果? I have tried df.rdd.collect().foreach(println) but keep getting attribute error 我已经尝试过df.rdd.collect()。foreach(println)但仍然遇到属性错误

AttributeError: 'list' object has no attribute 'foreach'
AttributeError                            Traceback (most recent call last)
in engine
      1 if __name__ == '__main__':
----> 2     main()

<ipython-input-1-c5cff78d4b35> in main()
     25 
     26     forest = df.rdd.mapPartitions(f=train_forest_per_partition_map_step).collect()
---> 27     lines  = df.rdd.take(100).foreach(println)
     28     
     29     # Reduce step: Combine scores from partitions.

AttributeError: 'list' object has no attribute 'foreach'

Guess that's only available for scala but would like to know the Python equivalent 猜猜这只适用于scala,但想知道Python等价物

Instead of print go for show,collect,count methods. 而不是打印去show,collect,count方法。 This will make the process execute at that very point 这将使该过程在该点执行

df.show()
df.filter("your_clause").collect()
df.count()

Let me know if this is what you were looking for 如果这是您正在寻找的,请告诉我

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM