Extract text from webpage using either Python or Applescript

Question

So I've constructed an Applescript that inputs information into a website. What I am now trying to figure out is a way to extract a "redirected URL" from the pages contents to store in a python shell string [Automator OSX].

Basically, I know how to scan html to find a body of text in Python if I know the URL. In these cases, I do not know the URL but the URL is on the webpage

I've thought of 2 different approaches:

1) Is there a way to extract text information from an open browser document in Applescript ? If it was Python, then I would just use regex to search for what I need but I don't know how to do this in Applescript.

If not, then

2) Is there a way to obtain the URL through Python of an open browser document? If so, then I would be able to use urllib to get the information I need.

I'm looking to extract the URL following:

"As soon as calculations are done, you can access your results here: "

***Note, the URL in the browser is the same as this URL, but only after the data has been processed. The time varies for each analysis so that is why I don't want to get the URL straight from the toolbar area. However, this link pops up instantaneously

在此处输入图片说明

The address for the webpage is:

http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi?P

Updated part of the question

3) If using Safari.app is there a way to click the "proceed" submit button using Applescript

Answer 1

Using safari.

And if the link is always the same index when counting the links.

ie link number 4.

You could try:

tell application "Safari"
    set thelink to do JavaScript "document.links[4].href " in document 1
end tell

Which will return the links url.

----------UPDATE

A second way is return the link that contains "RNAfold/"

tell application "Safari" to set thelinkCount to do JavaScript "document.links.length " in document 1
set theUrl to ""
repeat with i from 1 to thelinkCount
    tell application "Safari" to set this_link to (do JavaScript "document.links[" & i & "].href" in document 1) as string
    if this_link contains "RNAfold/" then
        set theUrl to this_link
        exit repeat
    end if
end repeat

log theUrl

UPDATE 2

This goes directly to the innerHTML of the link without iteration and returns the url string

tell application "Safari"
    tell document 1 to set theUrl to (do JavaScript "document.getElementsByTagName('BODY')[0].getElementsByTagName('b')[0].getElementsByTagName('a').item(0).innerHTML; ")
 end tell

UPDATE 3

Added after new part to question.

To click the "proceed" submit button. You get its class name and use some more javascript to click ii

do JavaScript "document.getElementsByClassName('proceed')[0].click()" in document 1

Full example

set theUrl to ""

tell application "Safari"

    tell document 1

        do JavaScript "document.getElementsByClassName('proceed')[0].click()"
        delay 1
        set timeoutCounter to 0
        repeat until (do JavaScript "document.readyState") is "complete"
            set timeoutCounter to timeoutCounter + 1

            delay 0.5
            if timeoutCounter is greater than 50 then
                exit repeat
            end if
        end repeat
        set theUrl to (do JavaScript "document.getElementsByTagName('BODY')[0].getElementsByTagName('b')[0].getElementsByTagName('a').item(0).innerHTML; ")

    end tell
end tell
log theUrl

Answer 2

No error correction at all here, but you could try, with Safari, something like:

tell application "Safari" to set s to source of document 1

set o1 to offset of "results here: <a href" in s
set o2 to offset of "</a></b><br><br>" in s

text (o1 + 23) thru (o2 - 1) of s

I saw the url, went to the site, used a sample RNA sequence, used the cgi, got to the page and ran this script, and it extracted the url. But (as I'm sure you know), that page auto-directs to another page within several seconds.

[edit:] or, getting the refresh meta tag from the top of the page:

tell application "Safari" to set s to source of document 1

set topRefreshMetaTagPar to paragraph 6 of s

text 45 thru -3 of topRefreshMetaTagPar

Extract text from webpage using either Python or Applescript

Question

http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi?P

2 answers

solution1
3 ACCPTED 2014-01-19 13:48:21

solution2
1 2014-01-18 21:02:55

Extract text from webpage using either Python or Applescript

Question

http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi?P

2 answers

solution1 3 ACCPTED 2014-01-19 13:48:21

solution2 1 2014-01-18 21:02:55

solution1
3 ACCPTED 2014-01-19 13:48:21

solution2
1 2014-01-18 21:02:55