[英]retrieve useful info from webpage using JSOUP
How can i retrieve the Contact us link from any webpage in world wide web from it's "footer" part of the page in JAVA. 如何从JAVA页面的“页脚”部分中的任何网页检索“与我们联系”链接。
Eg find footer element, or an element with id="footer" or having a footer class? 例如,找到页脚元素,或具有id =“ footer”或具有页脚类的元素?
I had tried retrieving all the links from webpage using JSOUP and then running regex .*contact.*
in it. 我曾尝试使用JSOUP检索网页中的所有链接,然后在其中运行regex
.*contact.*
。 But I cannot be 100% sure on that the fetched link from this approach is the contact us page of the website. 但是,我不能100%确定从此方法获取的链接是网站的“与我们联系”页面。
Q2 Q2
Is there any other robust approach or if i could use both footer link and my already completed approach to conclude if a page is certainly a contact us page? 还有其他健壮的方法吗?或者我可以同时使用页脚链接和已经完成的方法来推断某个页面是否一定是与我们联系的页面?
But I cannot be 100% sure on that the fetched link...
但是我不能100%确定所获取的链接...
You will NEVER be sure. 您永远不会确定。
For a given random HTML page, you want to find the "Contact Us" link. 对于给定的随机HTML页面,您想找到“联系我们”链接。 This kind of work is trivial for a human.
这种工作对人类来说是微不足道的。 It represents a big challenge for a computer.
对于计算机来说,这是一个巨大的挑战。
I can see some options in your case: 在您的情况下,我可以看到一些选择:
Option 1: Crowd sourcing 选项1:众包
Check if the platform offer an API. 检查平台是否提供API。
+ work done by human
+ dynamically adapt to unknown pattern
- cost money
- We suck at repetitive tasks
Option 2: IA (patten searching) 选项2: IA(专利检索)
Have a look at Weka for instance or Java-ML . 看一下Weka或Java-ML 。
+ Automated task
+ Can perform a repetitive task long time
- May take time to built a robust solution
- Risk of false positive or complete miss
Option 3: Use Jsoup 选项3:使用Jsoup
This option is a never ending task. 此选项是一个永无止境的任务。 You'll have to always feed Jsoup with new patterns.
您必须始终为Jsoup提供新的模式。 I suggest you having a monitoring system telling you when website escapes any known pattern.
我建议您有一个监视系统,告诉您网站何时逃脱任何已知的模式。
+ Automated task
+ Can perform a repetitive task long time
- Take time for studying, discovering, adding new patterns
- Risk of false positive or complete miss
Option 4: A mix of the three above options 选项4:以上三个选项的混合
You can have the three options working on the websites you target. 您可以在目标网站上使用三个选项。
+ Reduce chances of false positive or complete misses
+ More confident final result
- Take time for studying, discovering, adding new patterns
- Cost money
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.