[英]Get link text of links when crawling a website using crawler4j
I am using crawler4j to crawl a website. 我正在使用crawler4j爬行网站。 When I visit a page, I would like to get the link text of all the links, not only the full URLs.
当我访问页面时,我想获取所有链接的链接文本,而不仅仅是完整的URL。 Is this possible?
这可能吗?
Thanks in advance. 提前致谢。
In the class where you derive from WebCrawler, get the contents of the page and then apply a regular expression. 在从WebCrawler派生的类中,获取页面的内容,然后应用正则表达式。
Map<String, String> urlLinkText = new HashMap<String, String>();
String content = new String(page.getContentData(), page.getContentCharset());
Pattern pattern = Pattern.compile("<a[^>]*href=\"([^\"]*)\"[^>]*>([^<]*)</a[^>]*>", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
urlLinkText.put(matcher.group(1), matcher.group(2));
}
Then stick urlLinkText somewhere that you can get to it once your crawl is complete. 然后,将urlLinkText粘贴到可以完成爬网的位置。 For example you could make it a private member of your crawler class and add a getter.
例如,您可以将其设为搜寻器类的私有成员,然后添加吸气剂。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.