[英]Pyspark to Spark-scala conversion
各位開發者,
我正在創建動態固定長度文件讀取函數 - 其中架構將來自 JSON 文件:我的代碼語言是:scala,因為大多數現有代碼已經用 scala 編寫。
在瀏覽時,我找到了我需要的確切代碼,用 pyspark 編寫。 能否請您幫忙如何將其轉換為相應的 Spark-scala 代碼。特別是字典部分和循環部分
主要參考: 使用 pyspark 中 json 文件中的架構讀取固定寬度文件
SchemaFile.json
===========================
{"Column":"id","From":"1","To":"3"}
{"Column":"date","From":"4","To":"8"}
{"Column":"name","From":"12","To":"3"}
{"Column":"salary","From":"15","To":"5"}
File = spark.read\
.format("csv")\
.option("header","false")\
.load("C:\Temp\samplefile.txt")
SchemaFile = spark.read\
.format("json")\
.option("header","true")\
.json('C:\Temp\schemaFile\schema.json')
sfDict = map(lambda x: x.asDict(), SchemaFile.collect())
print(sfDict)
#[{'Column': u'id', 'From': u'1', 'To': u'3'},
# {'Column': u'date', 'From': u'4', 'To': u'8'},
# {'Column': u'name', 'From': u'12', 'To': u'3'},
# {'Column': u'salary', 'From': u'15', 'To': u'5'}
from pyspark.sql.functions import substring
File.select(
*[
substring(
str='_c0',
pos=int(row['From']),
len=int(row['To'])
).alias(row['Column'])
for row in sfDict
]
).show()
檢查下面的代碼。
scala> df.show(false)
+--------------------+
|value |
+--------------------+
|00120181120xyz12341 |
|00220180203abc56792 |
|00320181203pqr25483 |
+--------------------+
scala> schema.show(false)
+------+----+---+
|Column|From|To |
+------+----+---+
|id |1 |3 |
|date |4 |8 |
|name |12 |3 |
|salary|15 |5 |
+------+----+---+
scala> :paste
// Entering paste mode (ctrl-D to finish)
val columns = schema
.withColumn("id",lit(1))
.groupBy($"id")
.agg(collect_list(concat(lit("substring(value,"),$"from",lit(","),$"to",lit(") as "),$"column")).as("data"))
.withColumn("data",explode($"data"))
.select($"data")
.map(_.getAs[String](0))
.collect
// Exiting paste mode, now interpreting.
columns: Array[String] = Array(substring(value,1,3) as id, substring(value,4,8) as date, substring(value,12,3) as name, substring(value,15,5) as salary)
scala> df.selectExpr(columns:_*).show(false)
+---+--------+----+------+
|id |date |name|salary|
+---+--------+----+------+
|001|20181120|xyz |12341 |
|002|20180203|abc |56792 |
|003|20181203|pqr |25483 |
+---+--------+----+------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.