使用Python或Applescript从网页中提取文本

Question

因此，我构建了一个Applescript，将信息输入到网站中。 我现在试图找出的是一种从页面内容中提取“重定向URL”以存储在python shell字符串[Automator OSX]中的方法。

基本上，如果知道URL，我知道如何扫描html以在Python中查找文本正文。 在这些情况下，我不知道该URL，但是该URL在网页上

我想到了两种不同的方法：

1）有没有办法从Applescript中打开的浏览器文档中提取文本信息 ？ 如果是Python，那么我将使用正则表达式搜索所需的内容，但我不知道如何在Applescript中执行此操作。

如果没有，那么

2）是否可以通过Python获取打开的浏览器文档的URL？ 如果是这样，那么我将能够使用urllib来获取所需的信息。

我正在寻找提取以下URL：

“计算完成后，您可以在此处访问结果：”

***请注意，浏览器中的URL与该URL相同，但仅在处理完数据之后。 每次分析的时间各不相同，这就是为什么我不想直接从工具栏区域获取URL的原因。 但是，此链接会立即弹出

在此处输入图片说明

该网页的地址是：

http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi?P

问题的更新部分

3）如果使用Safari.app，则可以使用Applescript单击“进行”提交按钮

Answer 1

使用野生动物园。

如果链接计数时链接始终是相同的索引。

即链接号4。

您可以尝试：

tell application "Safari"
    set thelink to do JavaScript "document.links[4].href " in document 1
end tell

它将返回链接的URL。

----------更新

第二种方法是返回包含“ RNAfold /”的链接

tell application "Safari" to set thelinkCount to do JavaScript "document.links.length " in document 1
set theUrl to ""
repeat with i from 1 to thelinkCount
    tell application "Safari" to set this_link to (do JavaScript "document.links[" & i & "].href" in document 1) as string
    if this_link contains "RNAfold/" then
        set theUrl to this_link
        exit repeat
    end if
end repeat

log theUrl

更新2

它直接进入链接的innerHTML而不进行迭代，并返回url字符串

tell application "Safari"
    tell document 1 to set theUrl to (do JavaScript "document.getElementsByTagName('BODY')[0].getElementsByTagName('b')[0].getElementsByTagName('a').item(0).innerHTML; ")
 end tell

更新3

在新零件提出质疑后添加。

单击“进行”提交按钮。 您得到它的类名，并使用更多的JavaScript单击ii。

do JavaScript "document.getElementsByClassName('proceed')[0].click()" in document 1

完整的例子

set theUrl to ""

tell application "Safari"

    tell document 1

        do JavaScript "document.getElementsByClassName('proceed')[0].click()"
        delay 1
        set timeoutCounter to 0
        repeat until (do JavaScript "document.readyState") is "complete"
            set timeoutCounter to timeoutCounter + 1

            delay 0.5
            if timeoutCounter is greater than 50 then
                exit repeat
            end if
        end repeat
        set theUrl to (do JavaScript "document.getElementsByTagName('BODY')[0].getElementsByTagName('b')[0].getElementsByTagName('a').item(0).innerHTML; ")

    end tell
end tell
log theUrl

Answer 2

这里根本没有纠错，但是您可以尝试使用Safari，例如：

tell application "Safari" to set s to source of document 1

set o1 to offset of "results here: <a href" in s
set o2 to offset of "</a></b><br><br>" in s

text (o1 + 23) thru (o2 - 1) of s

我看到了URL，转到了站点，使用了样本RNA序列，使用了cgi，到达页面并运行了该脚本，然后提取了URL。 但是（我确定您知道），该页面会在几秒钟内自动重定向到另一个页面。

[edit：]或者，从页面顶部获取refresh meta标签：

tell application "Safari" to set s to source of document 1

set topRefreshMetaTagPar to paragraph 6 of s

text 45 thru -3 of topRefreshMetaTagPar

使用Python或Applescript从网页中提取文本

问题描述

http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi?P

2 个解决方案

解决方案1
3 已采纳 2014-01-19 13:48:21

解决方案2
1 2014-01-18 21:02:55

使用Python或Applescript从网页中提取文本

问题描述

http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi?P

2 个解决方案

解决方案1 3 已采纳 2014-01-19 13:48:21

解决方案2 1 2014-01-18 21:02:55

解决方案1
3 已采纳 2014-01-19 13:48:21

解决方案2
1 2014-01-18 21:02:55