[英]Best approach to split the single column into multiple columns Dataframe PySpark
實際上我是 PySpark 的初學者,我有一個 CSV 文件,其中大約包含(800 萬)條記錄,我通過 PySpark 讀取它作為 df,看起來像這樣:
此列包含的值是串聯的字符串,如 [經緯度時間戳、經度緯度時間戳、.....]。 現在我想把它分成三列,分別是經度、緯度和時間戳列。
For example: let's assume the first record as '[104.07515 30.72649 1540803847, 104.07515 30.72631 1540803850, 104.07514 30.72605 1540803851, 104.07516 30.72573 1540803854, 104.07513 30.72537 1540803857, 104.0751 30.72499 1540803860, 104.0751 30.72455 1540803863, 104.07506 30.7241 1540803866, 104.07501 30.72363 1540803869, 104.07497 30.72316 1540803872 , 104.07489 30.72264 1540803875, 104.07481 30.72211 1540803878, 104.07471 30.72159 1540803881, 104.07461 30.72107 1540803884]'.
output 應該是這樣的:
經度列:'[104.07515, 104.07515, 104.07514, 104.07516, 104.07513, ....]'。
緯度列:'[30.72649, 30.72631, 30.72605, 30.72573, 30.72537, 30.72499,......]'。
時間戳列:'[1540803847, 1540803850, 1540803851, 1540803854,......]'。
我正在嘗試在所有 dataframe 中找到最好的方法。
誰能建議是否有任何方法可以實現這一目標?
提前非常感謝。
您可以通過', '
拆分字符串,然后使用transform
將結果數組中的每個項目拆分為' '
,並從中獲取經度、緯度和時間戳。
df2 = df.selectExpr(
"split(trim('[]', Trajectory_GPS), ', ') as newcol"
).selectExpr(
"transform(newcol, x -> split(x, ' ')[0]) as longitude",
"transform(newcol, x -> split(x, ' ')[1]) as latitude",
"transform(newcol, x -> split(x, ' ')[2]) as timestamp"
)
df2.show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|longitude |latitude |timestamp |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[104.07515, 104.07515, 104.07514, 104.07516, 104.07513, 104.0751, 104.0751, 104.07506, 104.07501, 104.07497, 104.07489, 104.07481, 104.07471, 104.07461]|[30.72649, 30.72631, 30.72605, 30.72573, 30.72537, 30.72499, 30.72455, 30.7241, 30.72363, 30.72316, 30.72264, 30.72211, 30.72159, 30.72107]|[1540803847, 1540803850, 1540803851, 1540803854, 1540803857, 1540803860, 1540803863, 1540803866, 1540803869, 1540803872, 1540803875, 1540803878, 1540803881, 1540803884]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
要獲得經度/緯度的最大值/最小值,可以聚合 dataframe:
result = df2.agg(
F.max(F.array_max('longitude')).alias('max_long'),
F.min(F.array_min('longitude')).alias('min_long'),
F.max(F.array_max('latitude')).alias('max_lat'),
F.min(F.array_min('latitude')).alias('min_lat')
).head().asDict()
print(result)
# {'max_long': '104.07516', 'min_long': '104.07461', 'max_lat': '30.72649', 'min_lat': '30.72107'}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.