簡體   English   中英

無法在 Spark 獨立集群上完成 Spark 作業

[英]Unable to finish the spark job on spark standalone cluster

我是一個非常新手,只使用了一個星期的火花。 這是我在 pyspark 中的代碼,運行在具有單個主節點和兩個從節點的獨立 Spark 集群上。 嘗試運行一項作業,讀取 01.0 萬條記錄數據並對數據執行一些操作,然后將數據幀轉儲到 oracle 表上。我無法完成這項工作。 看起來這個程序創建了 404 個分區來完成任務。 在控制台或終端上,我可以看到 403/404 已完成,但分區 404 上的最后一個也是最后一個任務需要永遠完成作業。 我無法完成工作。 誰能告訴我我的代碼的問題。 任何人都可以幫助優化火花的性能,或者可以給我指點指南或其他東西嗎? 任何 tut 或指南都會有所幫助。 提前致謝

# creating a spark session
spark = SparkSession \
    .builder \
    .appName("pyspark_testing_29012020") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

# target table schema and column order
df_target = spark.read.csv("mycsv path", header = True)
df_local_schema = df_target.schema
df_column_order = df_target.columns

# dataframe with input file/dataset values and schema
df_source = spark.read\
    .format("csv")\
    .option("header", "false")\
    .option("inferschema", "true")\
    .option("delimiter", ",")\
    .schema(df_local_schema)\
    .load("csv path")


# dataframe with the target file/dataset values
df_target = spark.read\
    .format("jdbc") \
    .option("url", "jdbc:oracle:thin:system/oracle123@127.0.0.1:0101:orcl") \
    .option("dbtable", "mydata") \
    .option("user", "system") \
    .option("password", "oracle123") \
    .option("driver", "oracle.jdbc.driver.OracleDriver")\
    .load()


# splitting the target table in to upper and lower sections
df_target_upper = df_target.where(df_target['Key'] < 5) # set A
df_source_upper = df_source.where(df_source['Key'] < 5) # set B
df_source_lower = df_source.where(df_source['Key'] > 4) # set D
df_target_lower = df_target.where(df_target['key'] > 4) # set C


''' now programming for the upper segment of the data '''

# set operation A-B
A_minus_B = df_target_upper.join(df_source_upper,
                                 on=['key1', 'key2', 'key3', 'key4'],
                                 how='left_anti')
A_minus_B = A_minus_B.select(sorted(df_column_order))


# set operation B-A
B_minus_A = df_source_upper.join(df_target_upper,
                                 on=['key1', 'key2','key3','key4'],how = 'left_anti')
B_minus_A = B_minus_A.select(sorted(df_column_order))


# union of A-B and B-A
AmB_union_BmA = A_minus_B.union(B_minus_A)
AmB_union_BmA = AmB_union_BmA.select(sorted(df_column_order))

# A-B left anti B-A to get the uncommon record in both the dataframes
new_df = A_minus_B.join(B_minus_A, on=['key'], how = 'left_anti')
new_df = new_df.select(sorted(df_column_order))

AmB_union_BmA = AmB_union_BmA.select(sorted(df_column_order))

AnB = df_target_upper.join(df_source_upper,
                           on=['key1', 'key2', 'key3', 'key4'],
                           how='inner')

df_AnB_without_dupes = dropDupeDfCols(AnB)
new_AnB = df_AnB_without_dupes.select(sorted(df_column_order))


final_df = AmB_union_BmA.union(new_AnB)
final_df.show()
result_df = B_minus_A.union(new_df)

df_result_upper_seg = result_df.union(new_AnB)



''' now programming for the lower segment of the data '''

# set operation C-D
C_minus_D = df_target_lower.join(df_source_lower, on=['key'], how='left_anti')
C_minus_D = C_minus_D.select(sorted(df_column_order))


# set operation D-C
D_minus_C = df_source_lower.join(df_target_lower, on=['key'], how = 'left_anti')
D_minus_C = D_minus_C.select(sorted(df_column_order))


# union of C-D union D-C
CmD_union_DmC = C_minus_D.union(D_minus_C)
CmD_union_DmC = CmD_union_DmC.select(sorted(df_column_order))


# C-D left anti D-C to get the uncommon record in both the dataframes
lower_new_df = C_minus_D.join(D_minus_C, on=['key'], how = 'left_anti')
lower_new_df = lower_new_df.select(sorted(df_column_order))


CmD_union_DmC = CmD_union_DmC.select(sorted(df_column_order))

CnD = df_target_lower.join(df_source_lower,
                           on=['key'], how='inner')


new_CnD = dropDupeDfCols(CnD)
new_CnD = new_CnD.select(sorted(df_column_order))

lower_final_df = CmD_union_DmC.union(new_CnD)

result_df_lower = D_minus_C.union(lower_new_df)

df_result_lower_seg = result_df_lower.union(new_CnD)


df_final_result .write \
    .format("jdbc") \
    .mode("overwrite")\
    .option("url", "jdbc:oracle:thin:system/oracle123@127.0.0.1:1010:orcl") \
    .option("dbtable", "mydata") \
    .option("user", "system") \
    .option("password", "oracle123") \
    .option("driver", "oracle.jdbc.driver.OracleDriver") \
    .save()
  1. 看看Spark UI和監控指南
  2. 嘗試將您的工作分成幾個步驟。 然后找到失敗的步驟。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM