I have a huge data set of URLs from various domains. I have to process them via mapreduce so that the URLs with similar pattern are grouped together. For example
http://www.agricorner.com/price/onion-prices/
http://www.agricorner.com/price/potato-prices/
http://www.agricorner.com/price/ladyfinder-prices/
http://www.agricorner.com/tag/story/story-1.html
http://www.agricorner.com/tag/story/story-11.html
http://www.agricorner.com/tag/story/story-41.html
https://agrihunt.com/author/ramzan/page/3/
https://agrihunt.com/author/shahban/page/5/
https://agrihunt.com/author/Sufer/page/3/
I want to group these URLs based on their pattern ie, if URLs has similar pattern ( in reducer phase of Map-reduce). The expected output may be like
group1, http://www.agricorner.com/price/onion-prices/, http://www.agricorner.com/price/potato-prices/, http://www.agricorner.com/price/ladyfinder-prices/
group2, http://www.agricorner.com/tag/story/story-1.html, http://www.agricorner.com/tag/story/story-11.html, http://www.agricorner.com/tag/story/story-41.html
group3, https://agrihunt.com/author/ramzan/page/3/, https://agrihunt.com/author/shahban/page/5/, https://agrihunt.com/author/Sufer/page/3/
It it possible ? Is there any better approach that the supposed one?
update for similar pattern :
For above example "/price/ladyfinder-prices", "price/potato-prices/" and "/ladyfinder-prices/" are grouped togather as they have same domain, path up to some level. Same story for other examples. My scenerious is very close to the one discussed at github but how it works for map-reduce ?
Map each URL to the key with everything after the last /
removed.
Done. Straightforward, isn't it.
Anything more complicated is likely to fail, and you'll need to carefully consider further rules. For example, you could substitute \\d+
by 0
to capture further patterns. Or detect common formats of dates.
Anyway, write code to assign the same key to everything that should be the same group, and different keys to different groups.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.