简体   繁体   中英

PySpark RDD filter error from string to integer

I have an RDD called bank_rdd which has been imported from a CSV file.

First I have split each line separated by a comma into a list

bank_rdd1 = bank_rdd.map(lambda line: line.split(','))

The header titles are:

accountNumber, personFname, personLname, balance

I then removed the header

header = bank_rdd1.first()
bank_rdd1 = bank_rdd1.filter(lambda row: row != header)

All of the data in the CSV file is in format string. Sample data for the first two records as follows:

[('"1"','"John"','"Smith"','"01100"'),('"2"','"Jane"','"Doe"','"0500"')]

When I run the following code:

bank_rdd1_example = bank_rdd1.filter(lambda x: x[3] == '"01100"')
bank_rdd1_example.count()

I get a value of 1 which is correct because there is only one row in the dataset with a value of "01100".

When I run the following code I get an error:

bank_rdd1_example2 = bank_rdd1.filter(lambda x: int(x[3]) == 1100)
bank_rdd1_example2.count()

Basically I want this code to also return 1, but I am having trouble figuring it out.

Any help is appreciated!

You should find more documentation or the usage of map, lambda function for python. There are plenty of documents.

rdd2 = rdd.map(lambda l: l.replace('"','').split(','))
print(rdd2.collect())

rdd3 = rdd2.filter(lambda x: int(x[3]) == 1100)
print(rdd3.collect())

[['1', 'John', 'Smith', '01100'], ['2', 'Jane', 'Doe', '0500']]
[['1', 'John', 'Smith', '01100']]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM