简体   繁体   中英

What is the best way for URL grouping in Map-reduce job java

I have a huge data set of URLs from various domains. I have to process them via mapreduce so that the URLs with similar pattern are grouped together. For example

http://www.agricorner.com/price/onion-prices/
http://www.agricorner.com/price/potato-prices/
http://www.agricorner.com/price/ladyfinder-prices/

http://www.agricorner.com/tag/story/story-1.html
http://www.agricorner.com/tag/story/story-11.html
http://www.agricorner.com/tag/story/story-41.html

https://agrihunt.com/author/ramzan/page/3/
https://agrihunt.com/author/shahban/page/5/
https://agrihunt.com/author/Sufer/page/3/

I want to group these URLs based on their pattern ie, if URLs has similar pattern ( in reducer phase of Map-reduce). The expected output may be like

group1, http://www.agricorner.com/price/onion-prices/, http://www.agricorner.com/price/potato-prices/, http://www.agricorner.com/price/ladyfinder-prices/

group2, http://www.agricorner.com/tag/story/story-1.html, http://www.agricorner.com/tag/story/story-11.html, http://www.agricorner.com/tag/story/story-41.html

group3, https://agrihunt.com/author/ramzan/page/3/, https://agrihunt.com/author/shahban/page/5/, https://agrihunt.com/author/Sufer/page/3/

It it possible ? Is there any better approach that the supposed one?

update for similar pattern :

For above example "/price/ladyfinder-prices", "price/potato-prices/" and "/ladyfinder-prices/" are grouped togather as they have same domain, path up to some level. Same story for other examples. My scenerious is very close to the one discussed at github but how it works for map-reduce ?

Map each URL to the key with everything after the last / removed.

Done. Straightforward, isn't it.

Anything more complicated is likely to fail, and you'll need to carefully consider further rules. For example, you could substitute \\d+ by 0 to capture further patterns. Or detect common formats of dates.

Anyway, write code to assign the same key to everything that should be the same group, and different keys to different groups.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM