简体   繁体   English

使用crawler4j爬行网站时获取链接的链接文本

[英]Get link text of links when crawling a website using crawler4j

I am using crawler4j to crawl a website. 我正在使用crawler4j爬行网站。 When I visit a page, I would like to get the link text of all the links, not only the full URLs. 当我访问页面时,我想获取所有链接的链接文本,而不仅仅是完整的URL。 Is this possible? 这可能吗?

Thanks in advance. 提前致谢。

In the class where you derive from WebCrawler, get the contents of the page and then apply a regular expression. 在从WebCrawler派生的类中,获取页面的内容,然后应用正则表达式。

Map<String, String> urlLinkText = new HashMap<String, String>();
String content = new String(page.getContentData(), page.getContentCharset());
Pattern pattern = Pattern.compile("<a[^>]*href=\"([^\"]*)\"[^>]*>([^<]*)</a[^>]*>", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
    urlLinkText.put(matcher.group(1), matcher.group(2));
}

Then stick urlLinkText somewhere that you can get to it once your crawl is complete. 然后,将urlLinkText粘贴到可以完成爬网的位置。 For example you could make it a private member of your crawler class and add a getter. 例如,您可以将其设为搜寻器类的私有成员,然后添加吸气剂。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用Crawler4j抓取PDF - Crawling PDF's with Crawler4j 如何使用crawler4j提取页面上的所有链接? - How to extract all links on a page using crawler4j? 确定crawler4j的参数 - Determining parameters on crawler4j 使用crawler4j获取html页面中存在的所有iframe,base64代码 - Getting all iframes,base64 codes which are present in html pages using crawler4j 抓取网站时无法获取所有数据 - Can not get all the data when crawling a website 使用 python 中的 BeautifulSoup 抓取 sqlite 网站时无法获取正确的 href 值 - Cannot get correct href value when crawling sqlite website using BeautifulSoup in python 使用 Wget 获取网站链接中的每个 mp4 文件,检查该链接中的每个链接是否有嵌入的 mp4 链接 - using Wget to get every mp4 file in a website link, that checks every link in that link for embeded mp4 links 使用PageableListView填充链接以及部分时,如何添加锚链接? 我如何获得链接应指向的“ id”? - How to add anchor links when links as well sections are populated using PageableListView? How will I get the “id” to which my link should point to? 使用 xpath 和正则表达式获取单个链接上的文本 - Get text on individual links using xpath and regex 使用XPath获取带有链接的段落文本 - Using XPath to get text of paragraph with links inside
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM