简体   繁体   中英

Extract text from webpage using either Python or Applescript

So I've constructed an Applescript that inputs information into a website. What I am now trying to figure out is a way to extract a "redirected URL" from the pages contents to store in a python shell string [Automator OSX].

Basically, I know how to scan html to find a body of text in Python if I know the URL. In these cases, I do not know the URL but the URL is on the webpage

I've thought of 2 different approaches:

1) Is there a way to extract text information from an open browser document in Applescript ? If it was Python, then I would just use regex to search for what I need but I don't know how to do this in Applescript.

If not, then

2) Is there a way to obtain the URL through Python of an open browser document? If so, then I would be able to use urllib to get the information I need.

I'm looking to extract the URL following:

"As soon as calculations are done, you can access your results here: "

***Note, the URL in the browser is the same as this URL, but only after the data has been processed. The time varies for each analysis so that is why I don't want to get the URL straight from the toolbar area. However, this link pops up instantaneously

在此处输入图片说明

The address for the webpage is:

http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi?P

Updated part of the question

3) If using Safari.app is there a way to click the "proceed" submit button using Applescript

Using safari.

And if the link is always the same index when counting the links.

ie link number 4.

You could try:

tell application "Safari"
    set thelink to do JavaScript "document.links[4].href " in document 1
end tell

Which will return the links url.

----------UPDATE

A second way is return the link that contains "RNAfold/"

tell application "Safari" to set thelinkCount to do JavaScript "document.links.length " in document 1
set theUrl to ""
repeat with i from 1 to thelinkCount
    tell application "Safari" to set this_link to (do JavaScript "document.links[" & i & "].href" in document 1) as string
    if this_link contains "RNAfold/" then
        set theUrl to this_link
        exit repeat
    end if
end repeat

log theUrl

UPDATE 2

This goes directly to the innerHTML of the link without iteration and returns the url string

tell application "Safari"
    tell document 1 to set theUrl to (do JavaScript "document.getElementsByTagName('BODY')[0].getElementsByTagName('b')[0].getElementsByTagName('a').item(0).innerHTML; ")
 end tell

UPDATE 3

Added after new part to question.

To click the "proceed" submit button. You get its class name and use some more javascript to click ii

do JavaScript "document.getElementsByClassName('proceed')[0].click()" in document 1

Full example

set theUrl to ""

tell application "Safari"

    tell document 1

        do JavaScript "document.getElementsByClassName('proceed')[0].click()"
        delay 1
        set timeoutCounter to 0
        repeat until (do JavaScript "document.readyState") is "complete"
            set timeoutCounter to timeoutCounter + 1

            delay 0.5
            if timeoutCounter is greater than 50 then
                exit repeat
            end if
        end repeat
        set theUrl to (do JavaScript "document.getElementsByTagName('BODY')[0].getElementsByTagName('b')[0].getElementsByTagName('a').item(0).innerHTML; ")

    end tell
end tell
log theUrl

No error correction at all here, but you could try, with Safari, something like:

tell application "Safari" to set s to source of document 1

set o1 to offset of "results here: <a href" in s
set o2 to offset of "</a></b><br><br>" in s

text (o1 + 23) thru (o2 - 1) of s

I saw the url, went to the site, used a sample RNA sequence, used the cgi, got to the page and ran this script, and it extracted the url. But (as I'm sure you know), that page auto-directs to another page within several seconds.

[edit:] or, getting the refresh meta tag from the top of the page:

tell application "Safari" to set s to source of document 1

set topRefreshMetaTagPar to paragraph 6 of s

text 45 thru -3 of topRefreshMetaTagPar

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM