pyspark: Filter one RDD based on certain columns of another RDD

Question

I have two files in a spark cluster, foo.csv and bar.csv , both with 4 columns and the same exact fields: time, user, url, category .

I'd like to filter out the foo.csv , by certain columns of bar.csv . In the end, I want key/value pairs of (user, category): [list, of, urls]. For example:

foo.csv:
11:50:00, 111, www.google.com, search
11:50:00, 222, www.espn.com, news
11:50:00, 333, www.reddit.com, news
11:50:00, 444, www.amazon.com, store
11:50:00, 111, www.bing.com, search
11:50:00, 222, www.cnn.com, news
11:50:00, 333, www.aol.com, news
11:50:00, 444, www.jet.com, store
11:50:00, 111, www.yahoo.com, search
11:50:00, 222, www.bbc.com, news
11:50:00, 333, www.nytimes.com, news
11:50:00, 444, www.macys.com, store

bar.csv:
11:50:00, 222, www.bbc.com, news
11:50:00, 444, www.yahoo.com, store

Should result in:

{
(111, search):[www.google.com, www.bing.com, www.yahoo.com],
(333, news): [www.reddit.com, www.aol.com, www.nytimes.com]
}

In other words, if a (user, category) pair exist in bar.csv , I'd like to filter out all lines in foo.csv if they have that same exact (user, category) pair. Thus in the above example, I'd like to remove all lines in foo.csv with (222, news) and (444, store) . Ultimately, after I remove the lines I want, I'd like a dictionary with key/value pairs like: (user, category): [list, of, urls] .

Here's my code:

fooRdd = sc.textFile("file:///foo.txt/")
barRdd = sc.textFile("file:///bar.txt/")


parseFooRdd= fooRdd.map(lambda line: line.split(", "))
parseBarRdd = barRdd.map(lambda line: line.split(", "))



# (n[1] = user_id, n[3] = category_id) --> [n[2] = url]
fooGroupRdd = parseFooRdd.map(lambda n: ((n[1], n[3]), n[2])).groupByKey().map(lambda x: {x[0]: list(x[1])})
barGroupRdd = parseBarRdd.map(lambda n: ((n[1], n[3]), n[2])).groupByKey().map(lambda x: {x[0]: list(x[1])})

The above code works and gets the datasets in the format I want:

(user_id, category): [all, urls, visited, by, user, in, that, category]

However, couple issues: 1) I think it returns a list of dictionaries w/ just one k/v pair and 2) I'm stuck on what to do next. I know what to do in english: get the keys in barGroupRdd (tuples), and remove all lines in fooGroupRdd that have the same key. But I'm new to pyspark and I feel like there are commands I'm not taking advantage of. I think my code can be optimized. For example, I don't think I would need to create that barGroupRdd line because all I need from bar.csv is (user_id, category) -- I don't need to create a dictionary. I also think I should filter out first, and then create the dictionary from the result. Any help or advice is appreciated, thanks!

Answer 1

You're really quite close.

Instead of this for each RDD:

fooGroupRdd = parseFooRdd.map(lambda n: ((n[1], n[3]),\
    n[2])).groupByKey().map(lambda x: {x[0]: list(x[1])})

Do this:

fooGroupRdd = parseFooRdd.map(lambda n: ((n[1], n[3]),\
    n[2])).groupByKey().map(lambda x: [(x[0]), list(x[1])])

That way you can actually access the keys with the rdd.keys() method and create a bar_keys list.

bar_keys = barGroupRdd.keys().collect()

Then you can do exactly what you said. Filter the rows in fooGroupRdd that have a key in bar_keys.

dict(fooGroupRdd.filter(lambda x: x[0] not in bar_keys)\
    .map(lambda x: [x[0], x[1]]).collect())

The final result looks like this:

{('111', 'search'): ['www.google.com', 'www.bing.com', 'www.yahoo.com'],
 ('333', 'news'): ['www.reddit.com', 'www.aol.com', 'www.nytimes.com']}

Hope that helps.

Per you comment, I too wondered if this is the most efficient method. Looking into the class methods for RDD you will find collectAsMap() which works like collect, but returns a dictionary instead of a list. However, upon investigation of the source code the method simply does exactly what I did, so it would seem this is the best option.

pyspark: Filter one RDD based on certain columns of another RDD

Question

1 answers

solution1
2 ACCPTED 2017-01-31 22:59:23

pyspark: Filter one RDD based on certain columns of another RDD

Question

1 answers

solution1 2 ACCPTED 2017-01-31 22:59:23

solution1
2 ACCPTED 2017-01-31 22:59:23