简体   繁体   中英

pyspark: Filter one RDD based on certain columns of another RDD

I have two files in a spark cluster, foo.csv and bar.csv , both with 4 columns and the same exact fields: time, user, url, category .

I'd like to filter out the foo.csv , by certain columns of bar.csv . In the end, I want key/value pairs of (user, category): [list, of, urls]. For example:

foo.csv:
11:50:00, 111, www.google.com, search
11:50:00, 222, www.espn.com, news
11:50:00, 333, www.reddit.com, news
11:50:00, 444, www.amazon.com, store
11:50:00, 111, www.bing.com, search
11:50:00, 222, www.cnn.com, news
11:50:00, 333, www.aol.com, news
11:50:00, 444, www.jet.com, store
11:50:00, 111, www.yahoo.com, search
11:50:00, 222, www.bbc.com, news
11:50:00, 333, www.nytimes.com, news
11:50:00, 444, www.macys.com, store

bar.csv:
11:50:00, 222, www.bbc.com, news
11:50:00, 444, www.yahoo.com, store

Should result in:

{
(111, search):[www.google.com, www.bing.com, www.yahoo.com],
(333, news): [www.reddit.com, www.aol.com, www.nytimes.com]
}

In other words, if a (user, category) pair exist in bar.csv , I'd like to filter out all lines in foo.csv if they have that same exact (user, category) pair. Thus in the above example, I'd like to remove all lines in foo.csv with (222, news) and (444, store) . Ultimately, after I remove the lines I want, I'd like a dictionary with key/value pairs like: (user, category): [list, of, urls] .

Here's my code:

fooRdd = sc.textFile("file:///foo.txt/")
barRdd = sc.textFile("file:///bar.txt/")


parseFooRdd= fooRdd.map(lambda line: line.split(", "))
parseBarRdd = barRdd.map(lambda line: line.split(", "))



# (n[1] = user_id, n[3] = category_id) --> [n[2] = url]
fooGroupRdd = parseFooRdd.map(lambda n: ((n[1], n[3]), n[2])).groupByKey().map(lambda x: {x[0]: list(x[1])})
barGroupRdd = parseBarRdd.map(lambda n: ((n[1], n[3]), n[2])).groupByKey().map(lambda x: {x[0]: list(x[1])})

The above code works and gets the datasets in the format I want:

(user_id, category): [all, urls, visited, by, user, in, that, category]

However, couple issues: 1) I think it returns a list of dictionaries w/ just one k/v pair and 2) I'm stuck on what to do next. I know what to do in english: get the keys in barGroupRdd (tuples), and remove all lines in fooGroupRdd that have the same key. But I'm new to pyspark and I feel like there are commands I'm not taking advantage of. I think my code can be optimized. For example, I don't think I would need to create that barGroupRdd line because all I need from bar.csv is (user_id, category) -- I don't need to create a dictionary. I also think I should filter out first, and then create the dictionary from the result. Any help or advice is appreciated, thanks!

You're really quite close.

Instead of this for each RDD:

fooGroupRdd = parseFooRdd.map(lambda n: ((n[1], n[3]),\
    n[2])).groupByKey().map(lambda x: {x[0]: list(x[1])})

Do this:

fooGroupRdd = parseFooRdd.map(lambda n: ((n[1], n[3]),\
    n[2])).groupByKey().map(lambda x: [(x[0]), list(x[1])])

That way you can actually access the keys with the rdd.keys() method and create a bar_keys list.

bar_keys = barGroupRdd.keys().collect()

Then you can do exactly what you said. Filter the rows in fooGroupRdd that have a key in bar_keys.

dict(fooGroupRdd.filter(lambda x: x[0] not in bar_keys)\
    .map(lambda x: [x[0], x[1]]).collect())

The final result looks like this:

{('111', 'search'): ['www.google.com', 'www.bing.com', 'www.yahoo.com'],
 ('333', 'news'): ['www.reddit.com', 'www.aol.com', 'www.nytimes.com']}

Hope that helps.

Per you comment, I too wondered if this is the most efficient method. Looking into the class methods for RDD you will find collectAsMap() which works like collect, but returns a dictionary instead of a list. However, upon investigation of the source code the method simply does exactly what I did, so it would seem this is the best option.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM