使用JSOUP从网页检索有用的信息

Question

How can i retrieve the Contact us link from any webpage in world wide web from it's "footer" part of the page in JAVA. 如何从JAVA页面的“页脚”部分中的任何网页检索“与我们联系”链接。

Eg find footer element, or an element with id="footer" or having a footer class? 例如，找到页脚元素，或具有id =“ footer”或具有页脚类的元素？

I had tried retrieving all the links from webpage using JSOUP and then running regex .*contact.* in it. 我曾尝试使用JSOUP检索网页中的所有链接，然后在其中运行regex .*contact.* 。 But I cannot be 100% sure on that the fetched link from this approach is the contact us page of the website. 但是，我不能100％确定从此方法获取的链接是网站的“与我们联系”页面。

Q2 Q2

Is there any other robust approach or if i could use both footer link and my already completed approach to conclude if a page is certainly a contact us page? 还有其他健壮的方法吗？或者我可以同时使用页脚链接和已经完成的方法来推断某个页面是否一定是与我们联系的页面？

Answer 1

But I cannot be 100% sure on that the fetched link... 但是我不能100％确定所获取的链接...

SHORT ANSWER 短答案

You will NEVER be sure. 您永远不会确定。

LONG ANSWER 长答案

For a given random HTML page, you want to find the "Contact Us" link. 对于给定的随机HTML页面，您想找到“联系我们”链接。 This kind of work is trivial for a human. 这种工作对人类来说是微不足道的。 It represents a big challenge for a computer. 对于计算机来说，这是一个巨大的挑战。

I can see some options in your case: 在您的情况下，我可以看到一些选择：

Option 1: Crowd sourcing 选项1：众包

Fetch all the website urls you want the "Contact Us" information 获取您想要的“联系我们”信息的所有网站网址
Send them to a crowd service platform asking real people to find the information for you (Rapidworkers.com, Crowdsource.com, Clickworker.com, Amazon Mechanical Turk, microworkers.com) 将他们发送到人群服务平台，要求真实的人为您找到信息（Rapidworkers.com，Crowdsource.com，Clickworker.com，Amazon Mechanical Turk，microworkers.com）

Check if the platform offer an API. 检查平台是否提供API。

+ work done by human
+ dynamically adapt to unknown pattern
- cost money
- We suck at repetitive tasks

Option 2: IA (patten searching) 选项2： IA（专利检索）

Train an IA for extracting the information 训练IA以提取信息
Then through at it your websites 然后通过它您的网站

Have a look at Weka for instance or Java-ML . 看一下Weka或Java-ML 。

+ Automated task
+ Can perform a repetitive task long time
- May take time to built a robust solution
- Risk of false positive or complete miss

Option 3: Use Jsoup 选项3：使用Jsoup

Carefully study the pattern of the websites you target 仔细研究您所定位的网站的模式
Tell Jsoup to find the pattern you have detected 告诉Jsoup查找您检测到的模式

This option is a never ending task. 此选项是一个永无止境的任务。 You'll have to always feed Jsoup with new patterns. 您必须始终为Jsoup提供新的模式。 I suggest you having a monitoring system telling you when website escapes any known pattern. 我建议您有一个监视系统，告诉您网站何时逃脱任何已知的模式。

+ Automated task
+ Can perform a repetitive task long time
- Take time for studying, discovering, adding new patterns
- Risk of false positive or complete miss

Option 4: A mix of the three above options 选项4：以上三个选项的混合

You can have the three options working on the websites you target. 您可以在目标网站上使用三个选项。

+ Reduce chances of false positive or complete misses
+ More confident final result
- Take time for studying, discovering, adding new patterns
- Cost money

使用JSOUP从网页检索有用的信息

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-06-27 09:39:13

SHORT ANSWER 短答案

LONG ANSWER 长答案

使用JSOUP从网页检索有用的信息

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-06-27 09:39:13

SHORT ANSWER 短答案

LONG ANSWER 长答案

解决方案1
2 已采纳 2016-06-27 09:39:13