Finding the max of a row in a csv file while excluding the header in pyspark

Question

I have a CSV file with timestamps and headers and I want to be able to search by specific timestamps or by a row in pySpark.

textfile = sc.textFile("data.csv")\
header = textfile.first()
.textfile.filter(lambda line: line != header)
.map(lambda line: (line.split(',')[1], line.split(',')[2]))\
.distinct()\
.max()

I tried to use Spark SQL but I cannot figure it out.

Example input:

Time [-]    B1-1 EW AC [m/s^2]  B1-2 NS AC [m/s^2]  B2-1 EW AC [m/s^2]  B2-2 NS AC [m/s^2]  B3-1 EW AC [m/s^2]  B3-2 NS AC [m/s^2]  B4-1 EW AC [m/s^2]  B4-2 NS AC [m/s^2]  B5-1 EW AC [m/s^2]  B5-2 NS AC [m/s^2]  B6-1 EW AC [m/s^2]
15:14.1 0.07521612  -0.019558864    -0.004072318    0.057055011 0.033445455 0.10515116  -0.005318701    -0.10593631 -0.06616208 0.067418374 0.007425771
15:14.1 0.012684621 -0.025686748    -0.029669747    -0.015677277    -0.06540639 0.043687206 0.056057423 -0.005557867    -0.026925504    0.1059664   0.031872407
15:14.1 -0.054526106    0.016956611 0.001579062 0.044119116 -0.078679785    -0.1983114  0.096496433 0.02442093  0.020333124 0.025292056 0.022027005
15:14.1 -0.0030546  0.05305237  -0.023935258    0.002741382 0.073090985 -0.16384798 -0.009033349    0.17119914  0.003653608 -0.13548735 0.020024549
15:14.1 -0.034533042    0.077983625 0.018616311 -0.006082441    0.055625994 -0.002599431    -0.084086135    0.021557786 -0.008736889    -0.077502668    -0.076927647
15:14.1 0.056924593 0.037019137 0.044213742 -0.051229578    0.027507361 0.15999076  -0.015196289    -0.1391993  0.06187306  -0.057252757    -0.045555849
15:14.2 0.043737678 0.030471534 -0.038146816    0.024072761 0.003667648 0.27830678  0.040861133 0.010863103 -0.021127386    0.061481655 0.028952161
15:14.2 -0.008159212    -0.050701946    -0.060087472    0.014820596 -0.015980465    -0.034882683    0.09480796  -0.088252187    -0.022715911    0.053105187 0.067666292
15:14.2 -0.046869188    -0.073618554    0.038146816 0.00522576  -0.080775581    -0.13810523 0.05647954  -0.070147015    -0.030420261    0.066605121 0.034709219
15:14.2 -0.043891497    -0.070764467    0.006898009 0.020303361 -0.007422621    -0.049221478    -0.010299707    0.02526303  -0.030102555    -0.1053158  0.019607371
15:14.2 0.030550764 -0.040460825    -0.049532689    -0.031611562    0.068462759 0.030606201 -0.039510351    -0.063578628    0.040110264 -0.049770862    -0.029285904
15:14.2 0.028849226 0.063713208 0.042967115 -0.011136864    -0.015543842    0.038823754 -0.028788526    -0.047915548    0.11072022  0.066605121 -0.047224563
15:14.2 0.062029205 0.096451215 0.051527292 0.042834092 0.007859246 -0.027922917    -0.010721826    -0.049599752    -0.000555984    0.002683723 -0.055734996
15:14.2 -0.003905369    -0.016620837    -0.053605005    0.035295293 -0.012574793    -0.22321562 0.03503589  -0.035620872    -0.087845452    0.033668526 0.075425804
15:14.2 -0.016241515    -0.095359951    -0.080365956    0.045832481 0.00829587  -0.04678975 0.087463088 -0.019536743    -0.032405917    0.10035498  0.10804913
15:14.2 -0.058354565    0.030471534 0.019447397 -0.053799622    -0.050910447    0.18087006  0.098944724 -0.026105132    0.035106409 -0.10767422 0.021693261
15:14.2 0.005027703 0.008730136 0.060835447 0.021074373 0.017726965 -0.015261174    -0.022203466    0.00884206  -0.047496907    -0.010816217    -0.041884683
15:14.2 0.05862613  0.058760535 0.004072318 0.006853455 0.05606262  -0.13558966 -0.07539048 0.080336437 0.005639265 -0.006831295    -0.061825797

Expected output would just be the max value of a row.

It keeps telling me the sqlContext.createDataFrame() cannot accept data in Unicode.

I am new to all this so I would really appreciate any help.

Thank you

Answer 1

Using numpy :

import re
import numpy as np

(textfile
    .filter(lambda line: line != header)
    .map(lambda line: np.fromstring(re.split("\s+", line, 1)[1], sep="\t").max())
)

Using standard Python:

(textfile
    .filter(lambda line: line != header)
    .map(lambda line: max(float(x) for x in line.split()[1:])))

Finding the max of a row in a csv file while excluding the header in pyspark

Question

1 answers

solution1
1 2015-08-12 16:49:52

Finding the max of a row in a csv file while excluding the header in pyspark

Question

1 answers

solution1 1 2015-08-12 16:49:52

solution1
1 2015-08-12 16:49:52