简体   繁体   中英

Best approach to split the single column into multiple columns Dataframe PySpark

Actually I'm beginner with PySpark, and I have a CSV file with approximately contains (8 millions) records, I read it by PySpark as df that looks like this:

20行数据框

This column contains values is concatenated string as [longitude latitude timestamp, longitude latitude timestamp, .....]. Now I want to divide it into three columns which can be as longitude, latitude and timestamp columns separately.

For example: let's assume the first record as '[104.07515 30.72649 1540803847, 104.07515 30.72631 1540803850, 104.07514 30.72605 1540803851, 104.07516 30.72573 1540803854, 104.07513 30.72537 1540803857, 104.0751 30.72499 1540803860, 104.0751 30.72455 1540803863, 104.07506 30.7241 1540803866, 104.07501 30.72363 1540803869, 104.07497 30.72316 1540803872, 104.07489 30.72264 1540803875, 104.07481 30.72211 1540803878, 104.07471 30.72159 1540803881, 104.07461 30.72107 1540803884]'.

The output should be like:

Longitude column:'[104.07515, 104.07515, 104.07514, 104.07516, 104.07513, .......]'.

Latitude column: '[30.72649, 30.72631, 30.72605, 30.72573, 30.72537, 30.72499,......]'.

Timestamp column:'[1540803847, 1540803850, 1540803851, 1540803854,......]'.

I am trying to find the best approach to do that over all dataframe.

Can anyone please suggest if there is any way to achieve this?

Thanks a lot in advance.

You can split the string by ', ' , then split each item in the resulting array by ' ' using transform , and get the longitude, latitude and timestamp from that.

df2 = df.selectExpr(
    "split(trim('[]', Trajectory_GPS), ', ') as newcol"
).selectExpr(
    "transform(newcol, x -> split(x, ' ')[0]) as longitude", 
    "transform(newcol, x -> split(x, ' ')[1]) as latitude", 
    "transform(newcol, x -> split(x, ' ')[2]) as timestamp"
)

df2.show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|longitude                                                                                                                                               |latitude                                                                                                                                   |timestamp                                                                                                                                                               |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[104.07515, 104.07515, 104.07514, 104.07516, 104.07513, 104.0751, 104.0751, 104.07506, 104.07501, 104.07497, 104.07489, 104.07481, 104.07471, 104.07461]|[30.72649, 30.72631, 30.72605, 30.72573, 30.72537, 30.72499, 30.72455, 30.7241, 30.72363, 30.72316, 30.72264, 30.72211, 30.72159, 30.72107]|[1540803847, 1540803850, 1540803851, 1540803854, 1540803857, 1540803860, 1540803863, 1540803866, 1540803869, 1540803872, 1540803875, 1540803878, 1540803881, 1540803884]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

To get the max/min of longitude/latitude, you can aggregate the dataframe:

result = df2.agg(
    F.max(F.array_max('longitude')).alias('max_long'), 
    F.min(F.array_min('longitude')).alias('min_long'), 
    F.max(F.array_max('latitude')).alias('max_lat'), 
    F.min(F.array_min('latitude')).alias('min_lat')
).head().asDict()

print(result)
# {'max_long': '104.07516', 'min_long': '104.07461', 'max_lat': '30.72649', 'min_lat': '30.72107'}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM