PySpark RDD filter error from string to integer

Question

I have an RDD called bank_rdd which has been imported from a CSV file.

First I have split each line separated by a comma into a list

bank_rdd1 = bank_rdd.map(lambda line: line.split(','))

The header titles are:

accountNumber, personFname, personLname, balance

I then removed the header

header = bank_rdd1.first()
bank_rdd1 = bank_rdd1.filter(lambda row: row != header)

All of the data in the CSV file is in format string. Sample data for the first two records as follows:

[('"1"','"John"','"Smith"','"01100"'),('"2"','"Jane"','"Doe"','"0500"')]

When I run the following code:

bank_rdd1_example = bank_rdd1.filter(lambda x: x[3] == '"01100"')
bank_rdd1_example.count()

I get a value of 1 which is correct because there is only one row in the dataset with a value of "01100".

When I run the following code I get an error:

bank_rdd1_example2 = bank_rdd1.filter(lambda x: int(x[3]) == 1100)
bank_rdd1_example2.count()

Basically I want this code to also return 1, but I am having trouble figuring it out.

Any help is appreciated!

Answer 1

You should find more documentation or the usage of map, lambda function for python. There are plenty of documents.

rdd2 = rdd.map(lambda l: l.replace('"','').split(','))
print(rdd2.collect())

rdd3 = rdd2.filter(lambda x: int(x[3]) == 1100)
print(rdd3.collect())

[['1', 'John', 'Smith', '01100'], ['2', 'Jane', 'Doe', '0500']]
[['1', 'John', 'Smith', '01100']]

PySpark RDD filter error from string to integer

Question

1 answers

solution1
0 2020-08-29 07:08:35

PySpark RDD filter error from string to integer

Question

1 answers

solution1 0 2020-08-29 07:08:35

solution1
0 2020-08-29 07:08:35