简体   繁体   English

需要从网页上抓取网址

[英]Need to scrape an url from a web page

I need to scrape a url from a website which is located within some javascript code. 我需要从某个JavaScript代码内的网站上抓取一个网址。

<script type="text/javascript">
    (function() {
        // somewhere..
        $.get("http://someurl.com?q=34343&b=343434&c=343434")...
    });
</script>

I know that the url starts with http://someurl.com?q= and it needs to have at least a second query parameter ( &b= ) inside, but the rest of the content is unknown. 我知道url以http://someurl.com?q=开头,它至少需要包含第二个查询参数( &b= ),但是其余内容未知。

I initially tried with jsoup , however it's not really suitable for that task. 我最初尝试使用jsoup ,但是它实际上并不适合该任务。 Manually fetching the page and then applying a regex pattern on it is also not a preferable option since the page is huge. 手动获取页面,然后在页面上应用正则表达式模式也不是首选方法,因为页面很大。 What could I do to get the url quick and safe? 我该怎么做才能快速安全地获取网址?

You can use this regex 您可以使用此正则表达式

/\$\.get\("(http:\/\/someurl\.com\?q=[\w.\-%#\/]*&b=[\w.\-%&=\/]*)"\)/g

This regex will search directly for this string: 此正则表达式将直接搜索以下字符串:

$.get("http://someurl.com?q=

It will then allow any number of URL valid characters to occur as the value of q. 然后,它将允许任意数量的URL有效字符作为q的值出现。

It will then look to match 然后它将看起来匹配

&b=

and then again any number of valid characters followed by the opposing quotation marks. 然后再输入任意数量的有效字符,后跟相反的引号。 I tested it with 我用

MATCH - $.get("http://someurl.com?q=34343&b=343434&c=343434")
MATCH - $.get("http://someurl.com?q=34343&b=13a43&k=343434&c2=something")
FAIL  - $.get("http://someurl.com?q=34343&c=343434&b=343434")
FAIL  - $.get("http://someurl.com?a=34343&b=343434=343434")

If you only want to return the first result you can remove the global identifier from the end 如果只想返回第一个结果,则可以从末尾删除全局标识符

/\$\.get\("(http:\/\/someurl\.com\?q=[\w.\-%#\/]*&b=[\w.\-%&=\/]*)"\)/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM