I have an RDD called bank_rdd
which has been imported from a CSV file.
First I have split each line separated by a comma into a list
bank_rdd1 = bank_rdd.map(lambda line: line.split(','))
The header titles are:
accountNumber, personFname, personLname, balance
I then removed the header
header = bank_rdd1.first()
bank_rdd1 = bank_rdd1.filter(lambda row: row != header)
All of the data in the CSV file is in format string. Sample data for the first two records as follows:
[('"1"','"John"','"Smith"','"01100"'),('"2"','"Jane"','"Doe"','"0500"')]
When I run the following code:
bank_rdd1_example = bank_rdd1.filter(lambda x: x[3] == '"01100"')
bank_rdd1_example.count()
I get a value of 1 which is correct because there is only one row in the dataset with a value of "01100".
When I run the following code I get an error:
bank_rdd1_example2 = bank_rdd1.filter(lambda x: int(x[3]) == 1100)
bank_rdd1_example2.count()
Basically I want this code to also return 1, but I am having trouble figuring it out.
Any help is appreciated!
You should find more documentation or the usage of map, lambda function for python. There are plenty of documents.
rdd2 = rdd.map(lambda l: l.replace('"','').split(','))
print(rdd2.collect())
rdd3 = rdd2.filter(lambda x: int(x[3]) == 1100)
print(rdd3.collect())
[['1', 'John', 'Smith', '01100'], ['2', 'Jane', 'Doe', '0500']]
[['1', 'John', 'Smith', '01100']]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.