简体   繁体   中英

retrieve useful info from webpage using JSOUP

How can i retrieve the Contact us link from any webpage in world wide web from it's "footer" part of the page in JAVA.

Eg find footer element, or an element with id="footer" or having a footer class?

I had tried retrieving all the links from webpage using JSOUP and then running regex .*contact.* in it. But I cannot be 100% sure on that the fetched link from this approach is the contact us page of the website.

Q2

Is there any other robust approach or if i could use both footer link and my already completed approach to conclude if a page is certainly a contact us page?

But I cannot be 100% sure on that the fetched link...

SHORT ANSWER

You will NEVER be sure.


LONG ANSWER

For a given random HTML page, you want to find the "Contact Us" link. This kind of work is trivial for a human. It represents a big challenge for a computer.

I can see some options in your case:

Option 1: Crowd sourcing

  • Fetch all the website urls you want the "Contact Us" information
  • Send them to a crowd service platform asking real people to find the information for you (Rapidworkers.com, Crowdsource.com, Clickworker.com, Amazon Mechanical Turk, microworkers.com)

Check if the platform offer an API.

+ work done by human
+ dynamically adapt to unknown pattern
- cost money
- We suck at repetitive tasks

Option 2: IA (patten searching)

  • Train an IA for extracting the information
  • Then through at it your websites

Have a look at Weka for instance or Java-ML .

+ Automated task
+ Can perform a repetitive task long time
- May take time to built a robust solution
- Risk of false positive or complete miss

Option 3: Use Jsoup

  • Carefully study the pattern of the websites you target
  • Tell Jsoup to find the pattern you have detected

This option is a never ending task. You'll have to always feed Jsoup with new patterns. I suggest you having a monitoring system telling you when website escapes any known pattern.

+ Automated task
+ Can perform a repetitive task long time
- Take time for studying, discovering, adding new patterns
- Risk of false positive or complete miss

Option 4: A mix of the three above options

You can have the three options working on the websites you target.

+ Reduce chances of false positive or complete misses
+ More confident final result
- Take time for studying, discovering, adding new patterns
- Cost money

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM