使用crawler4j爬行网站时获取链接的链接文本

Question

I am using crawler4j to crawl a website. 我正在使用crawler4j爬行网站。 When I visit a page, I would like to get the link text of all the links, not only the full URLs. 当我访问页面时，我想获取所有链接的链接文本，而不仅仅是完整的URL。 Is this possible? 这可能吗？

Thanks in advance. 提前致谢。

Answer 1

In the class where you derive from WebCrawler, get the contents of the page and then apply a regular expression. 在从WebCrawler派生的类中，获取页面的内容，然后应用正则表达式。

Map<String, String> urlLinkText = new HashMap<String, String>();
String content = new String(page.getContentData(), page.getContentCharset());
Pattern pattern = Pattern.compile("<a[^>]*href=\"([^\"]*)\"[^>]*>([^<]*)</a[^>]*>", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
    urlLinkText.put(matcher.group(1), matcher.group(2));
}

Then stick urlLinkText somewhere that you can get to it once your crawl is complete. 然后，将urlLinkText粘贴到可以完成爬网的位置。 For example you could make it a private member of your crawler class and add a getter. 例如，您可以将其设为搜寻器类的私有成员，然后添加吸气剂。

使用crawler4j爬行网站时获取链接的链接文本

问题描述

1 个解决方案

解决方案1
1 已采纳 2012-06-14 04:46:51

使用crawler4j爬行网站时获取链接的链接文本

问题描述

1 个解决方案

解决方案1 1 已采纳 2012-06-14 04:46:51

解决方案1
1 已采纳 2012-06-14 04:46:51