[英]LucidWorks: Java Regular Expressions & GNU Regular Expressions
I am trying to create regular expressions so that I can crawl and index certain URL's on my web site with LucidWorks. 我正在尝试创建正则表达式,以便可以使用LucidWorks对我的网站上的某些URL进行爬网和索引。
Example URL: http://www.example.com/reviews/assassins-creed-revelations/24475 /reviews/ Example URL: http://www.example.com/reviews/super-mario-3d-land/64303 /reviews/ 范例网址: http : //www.example.com/reviews/assassins-creed-revelations/24475 / reviews /范例网址: http : //www.example.com/reviews/super-mario-3d-land/64303 /评论/
Basically, I want LucidWorks to search my entire site and index only URL'S that have /reviews/ at the end of the URL. 基本上,我希望LucidWorks搜索我的整个网站,并且仅索引URL末尾带有/ reviews /的URL。
Could anyone help me construct an expression to do that please? 谁能帮我构建一个表达式来做到这一点? :) :)
Updated: 更新:
URL: http://www.example.com/ 网址: http : //www.example.com/
Include paths: / /*/reviews/* 包含路径: / / * /评论/ *
That kind of worked, but it only crawls the first page, it won't go to the next page with more reviews (1,2,3 etc). 这种工作方式有效,但只会抓取第一页,而不会进入具有更多评论(1、2、3等)的下一页。
If I also add: / / /reviews/.* 如果我还添加: // //reviews/.*
I get a load of pages indexed which I don't want such as http://www.example.com/?page=2 我得到了一些我不想索引的页面,例如http://www.example.com/?page=2
Check with this function
public boolean canAcceptURL(String url,String endsWith){
boolean canAccept = false;
String regex = "";
try{
if(endsWith.equals("")){
endsWith = "/reviews/";
}
regex = "[\\x20-\\x7E]*"+endsWith+"$";//Check the url string u passed ends with the endString you hav passed.If end string is null it will take the default value.
canAccept = url.matches(regex);
}catch (PatternSyntaxException pe) {
pe.printStackTrace();
}catch (Exception e) {
e.printStackTrace();
}
System.out.println("String matches : "+canAccept);
return canAccept;
}
Sample out put :
calling function : canAcceptURL("http://www.example.com/reviews/super-mario-3d-land/64303/reviews/","/reviews/");
String matches : true
if you want to get the url contains *'/reviews/'* just change the regex string to
String regex = "[\\x20-\\x7E]*/reviews/[\\x20-\\x7E]*"; // this will accept a string with white space and special character.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.