需要从网页上抓取网址

Question

I need to scrape a url from a website which is located within some javascript code. 我需要从某个JavaScript代码内的网站上抓取一个网址。

<script type="text/javascript">
    (function() {
        // somewhere..
        $.get("http://someurl.com?q=34343&b=343434&c=343434")...
    });
</script>

I know that the url starts with http://someurl.com?q= and it needs to have at least a second query parameter ( &b= ) inside, but the rest of the content is unknown. 我知道url以http://someurl.com?q=开头，它至少需要包含第二个查询参数（ &b= ），但是其余内容未知。

I initially tried with jsoup , however it's not really suitable for that task. 我最初尝试使用jsoup ，但是它实际上并不适合该任务。 Manually fetching the page and then applying a regex pattern on it is also not a preferable option since the page is huge. 手动获取页面，然后在页面上应用正则表达式模式也不是首选方法，因为页面很大。 What could I do to get the url quick and safe? 我该怎么做才能快速安全地获取网址？

Answer 1

You can use this regex 您可以使用此正则表达式

/\$\.get\("(http:\/\/someurl\.com\?q=[\w.\-%#\/]*&b=[\w.\-%&=\/]*)"\)/g

This regex will search directly for this string: 此正则表达式将直接搜索以下字符串：

$.get("http://someurl.com?q=

It will then allow any number of URL valid characters to occur as the value of q. 然后，它将允许任意数量的URL有效字符作为q的值出现。

It will then look to match 然后它将看起来匹配

&b=

and then again any number of valid characters followed by the opposing quotation marks. 然后再输入任意数量的有效字符，后跟相反的引号。 I tested it with 我用

MATCH - $.get("http://someurl.com?q=34343&b=343434&c=343434")
MATCH - $.get("http://someurl.com?q=34343&b=13a43&k=343434&c2=something")
FAIL  - $.get("http://someurl.com?q=34343&c=343434&b=343434")
FAIL  - $.get("http://someurl.com?a=34343&b=343434=343434")

If you only want to return the first result you can remove the global identifier from the end 如果只想返回第一个结果，则可以从末尾删除全局标识符

/\$\.get\("(http:\/\/someurl\.com\?q=[\w.\-%#\/]*&b=[\w.\-%&=\/]*)"\)/

需要从网页上抓取网址

问题描述

1 个解决方案

解决方案1
0 已采纳 2015-04-10 03:27:59

需要从网页上抓取网址

问题描述

1 个解决方案

解决方案1 0 已采纳 2015-04-10 03:27:59

解决方案1
0 已采纳 2015-04-10 03:27:59