简体   繁体   English

在 Map-reduce 作业 java 中进行 URL 分组的最佳方法是什么

[英]What is the best way for URL grouping in Map-reduce job java

I have a huge data set of URLs from various domains.我有来自不同域的大量 URL 数据集。 I have to process them via mapreduce so that the URLs with similar pattern are grouped together.我必须通过 mapreduce 处理它们,以便将具有相似模式的 URL 组合在一起。 For example例如

http://www.agricorner.com/price/onion-prices/
http://www.agricorner.com/price/potato-prices/
http://www.agricorner.com/price/ladyfinder-prices/

http://www.agricorner.com/tag/story/story-1.html
http://www.agricorner.com/tag/story/story-11.html
http://www.agricorner.com/tag/story/story-41.html

https://agrihunt.com/author/ramzan/page/3/
https://agrihunt.com/author/shahban/page/5/
https://agrihunt.com/author/Sufer/page/3/

I want to group these URLs based on their pattern ie, if URLs has similar pattern ( in reducer phase of Map-reduce).我想根据它们的模式对这些 URL 进行分组,即,如果 URL 具有相似的模式(在 Map-reduce 的 reducer 阶段)。 The expected output may be like预期的输出可能类似于

group1, http://www.agricorner.com/price/onion-prices/, http://www.agricorner.com/price/potato-prices/, http://www.agricorner.com/price/ladyfinder-prices/

group2, http://www.agricorner.com/tag/story/story-1.html, http://www.agricorner.com/tag/story/story-11.html, http://www.agricorner.com/tag/story/story-41.html

group3, https://agrihunt.com/author/ramzan/page/3/, https://agrihunt.com/author/shahban/page/5/, https://agrihunt.com/author/Sufer/page/3/

It it possible ?这可能吗? Is there any better approach that the supposed one?有没有比假设的更好的方法?

update for similar pattern :更新类似模式

For above example "/price/ladyfinder-prices", "price/potato-prices/" and "/ladyfinder-prices/" are grouped togather as they have same domain, path up to some level.对于上面的示例,“/price/ladyfinder-prices”、“price/potato-prices/”和“/ladyfinder-prices/”被组合在一起,因为它们具有相同的域,路径达到某个级别。 Same story for other examples.其他示例的相同故事。 My scenerious is very close to the one discussed at github but how it works for map-reduce ?我的风景非常接近github 上讨论的那个,但是它如何用于 map-reduce ?

Map each URL to the key with everything after the last / removed.将每个 URL 映射到最后一个/删除之后的所有内容的键。

Done.完毕。 Straightforward, isn't it.直截了当,不是吗。

Anything more complicated is likely to fail, and you'll need to carefully consider further rules.任何更复杂的事情都可能失败,您需要仔细考虑进一步的规则。 For example, you could substitute \\d+ by 0 to capture further patterns.例如,您可以将\\d+替换为0以捕获更多模式。 Or detect common formats of dates.或者检测常见的日期格式。

Anyway, write code to assign the same key to everything that should be the same group, and different keys to different groups.无论如何,编写代码将相同的键分配给应该属于同一组的所有内容,并将不同的键分配给不同的组。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM