簡體   English   中英

將單列拆分為多列的最佳方法 Dataframe PySpark

[英]Best approach to split the single column into multiple columns Dataframe PySpark

實際上我是 PySpark 的初學者,我有一個 CSV 文件,其中大約包含(800 萬)條記錄,我通過 PySpark 讀取它作為 df,看起來像這樣:

20行數據框

此列包含的值是串聯的字符串,如 [經緯度時間戳、經度緯度時間戳、.....]。 現在我想把它分成三列,分別是經度、緯度和時間戳列。

For example: let's assume the first record as '[104.07515 30.72649 1540803847, 104.07515 30.72631 1540803850, 104.07514 30.72605 1540803851, 104.07516 30.72573 1540803854, 104.07513 30.72537 1540803857, 104.0751 30.72499 1540803860, 104.0751 30.72455 1540803863, 104.07506 30.7241 1540803866, 104.07501 30.72363 1540803869, 104.07497 30.72316 1540803872 , 104.07489 30.72264 1540803875, 104.07481 30.72211 1540803878, 104.07471 30.72159 1540803881, 104.07461 30.72107 1540803884]'.

output 應該是這樣的:

經度列:'[104.07515, 104.07515, 104.07514, 104.07516, 104.07513, ....]'。

緯度列:'[30.72649, 30.72631, 30.72605, 30.72573, 30.72537, 30.72499,......]'。

時間戳列:'[1540803847, 1540803850, 1540803851, 1540803854,......]'。

我正在嘗試在所有 dataframe 中找到最好的方法。

誰能建議是否有任何方法可以實現這一目標?

提前非常感謝。

您可以通過', '拆分字符串,然后使用transform將結果數組中的每個項目拆分為' ' ,並從中獲取經度、緯度和時間戳。

df2 = df.selectExpr(
    "split(trim('[]', Trajectory_GPS), ', ') as newcol"
).selectExpr(
    "transform(newcol, x -> split(x, ' ')[0]) as longitude", 
    "transform(newcol, x -> split(x, ' ')[1]) as latitude", 
    "transform(newcol, x -> split(x, ' ')[2]) as timestamp"
)

df2.show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|longitude                                                                                                                                               |latitude                                                                                                                                   |timestamp                                                                                                                                                               |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[104.07515, 104.07515, 104.07514, 104.07516, 104.07513, 104.0751, 104.0751, 104.07506, 104.07501, 104.07497, 104.07489, 104.07481, 104.07471, 104.07461]|[30.72649, 30.72631, 30.72605, 30.72573, 30.72537, 30.72499, 30.72455, 30.7241, 30.72363, 30.72316, 30.72264, 30.72211, 30.72159, 30.72107]|[1540803847, 1540803850, 1540803851, 1540803854, 1540803857, 1540803860, 1540803863, 1540803866, 1540803869, 1540803872, 1540803875, 1540803878, 1540803881, 1540803884]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

要獲得經度/緯度的最大值/最小值,可以聚合 dataframe:

result = df2.agg(
    F.max(F.array_max('longitude')).alias('max_long'), 
    F.min(F.array_min('longitude')).alias('min_long'), 
    F.max(F.array_max('latitude')).alias('max_lat'), 
    F.min(F.array_min('latitude')).alias('min_lat')
).head().asDict()

print(result)
# {'max_long': '104.07516', 'min_long': '104.07461', 'max_lat': '30.72649', 'min_lat': '30.72107'}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM