retrieve useful info from webpage using JSOUP

Question

How can i retrieve the Contact us link from any webpage in world wide web from it's "footer" part of the page in JAVA.

Eg find footer element, or an element with id="footer" or having a footer class?

I had tried retrieving all the links from webpage using JSOUP and then running regex .*contact.* in it. But I cannot be 100% sure on that the fetched link from this approach is the contact us page of the website.

Q2

Is there any other robust approach or if i could use both footer link and my already completed approach to conclude if a page is certainly a contact us page?

Answer 1

But I cannot be 100% sure on that the fetched link...

SHORT ANSWER

You will NEVER be sure.

LONG ANSWER

For a given random HTML page, you want to find the "Contact Us" link. This kind of work is trivial for a human. It represents a big challenge for a computer.

I can see some options in your case:

Option 1: Crowd sourcing

Fetch all the website urls you want the "Contact Us" information
Send them to a crowd service platform asking real people to find the information for you (Rapidworkers.com, Crowdsource.com, Clickworker.com, Amazon Mechanical Turk, microworkers.com)

Check if the platform offer an API.

+ work done by human
+ dynamically adapt to unknown pattern
- cost money
- We suck at repetitive tasks

Option 2: IA (patten searching)

Train an IA for extracting the information
Then through at it your websites

Have a look at Weka for instance or Java-ML .

+ Automated task
+ Can perform a repetitive task long time
- May take time to built a robust solution
- Risk of false positive or complete miss

Option 3: Use Jsoup

Carefully study the pattern of the websites you target
Tell Jsoup to find the pattern you have detected

This option is a never ending task. You'll have to always feed Jsoup with new patterns. I suggest you having a monitoring system telling you when website escapes any known pattern.

+ Automated task
+ Can perform a repetitive task long time
- Take time for studying, discovering, adding new patterns
- Risk of false positive or complete miss

Option 4: A mix of the three above options

You can have the three options working on the websites you target.

+ Reduce chances of false positive or complete misses
+ More confident final result
- Take time for studying, discovering, adding new patterns
- Cost money

retrieve useful info from webpage using JSOUP

Question

1 answers

solution1
2 ACCPTED 2016-06-27 09:39:13

SHORT ANSWER

LONG ANSWER

retrieve useful info from webpage using JSOUP

Question

1 answers

solution1 2 ACCPTED 2016-06-27 09:39:13

SHORT ANSWER

LONG ANSWER

solution1
2 ACCPTED 2016-06-27 09:39:13