简体   繁体   English

提取 Ruby 中字符串内的所有 url

[英]Extract all urls inside a string in Ruby

I have some text content with a list of URLs contained in it.我有一些文本内容,其中包含一个 URL 列表。

I am trying to grab all the URLs out and put them in an array.我正在尝试抓取所有 URL 并将它们放入一个数组中。

I have this code我有这个代码

content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html"

urls = content.scan(/^(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.*)?$/ix)

I am trying to get the end results to be:我试图让最终结果是:

['http://www.google.com', 'http://www.google.com/index.html']

The above code does not seem to be working correctly.上面的代码似乎没有正常工作。 Does anyone know what I am doing wrong?有谁知道我做错了什么?

Thanks谢谢

Easy: 简单:

ruby-1.9.2-p136 :006 > require 'uri'
ruby-1.9.2-p136 :006 > URI.extract(content, ['http', 'https'])
  => ["http://www.google.com", "http://www.google.com/index.html"] 

I haven't checked the syntax of your regex, but String.scan will produce an array, each of whose members is an array of the groups matched by your regex. 我没有检查你的正则表达式的语法,但String.scan将生成一个数组,其每个成员是由你的正则表达式匹配的组的数组。 So I'd expect the result to be: 所以我希望结果如下:

[['http', '.google.com'], ...]

You'll need non-matching groups /(?:stuff)/ if you want the format you've given. 如果你想要你给出的格式,你需要不匹配的组/(?:stuff)/

Edit (looking at regex): Also, your regex does look a bit wrong. 编辑(看正则表达式):另外,你的正则表达式确实看起来有点不对劲。 You don't want the start and end anchors ( ^ and $ ), since you don't expect the matches to be at start and end of content . 您不需要开始和结束锚点( ^$ ),因为您不希望匹配位于content开头和结尾。 Secondly, if your ([0-9]{1,5})? 其次,如果你([0-9]{1,5})? is trying to capture a port number, I think you're missing a colon to separate the domain from the port. 我试图捕获一个端口号,我想你错过了一个冒号来将域与端口分开。

Further edit, after playing: I think you want something like this: 玩完后进一步编辑:我想你想要这样的东西:

content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html http://example.com:3000/foo"
urls = content.scan(/(?:http|https):\/\/[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(?:(?::[0-9]{1,5})?\/[^\s]*)?/ix)
# => ["http://www.google.com", "http://www.google.com/index.html", "http://example.com:3000/foo"]

... but note that it won't match pure IP-address URLs (like http://127.0.0.1 ), because of the [az]{2,5} for the TLD. ...但请注意,它与纯IP地址URL(如http://127.0.0.1 )不匹配,因为TLD的[az]{2,5}

一种不同的方法,从完美是好的敌人的思想流派:

urls = content.split(/\s+/).find_all { |u| u =~ /^https?:/ }

just for your interest: 只是为了你的兴趣:

Ruby has an URI Module, which has a regex implemented to do such things: Ruby有一个URI模块,它有一个正则表达式来实现这样的事情:

require "uri"

uris_you_want_to_grap = ['ftp','http','https','ftp','mailto','see']

html_string.scan(URI.regexp(uris_you_want_to_grap)) do |*matches|
  urls << $&
end

For more information visit the Ruby Ref: URI 有关更多信息,请访问Ruby Ref: URI

The most upvoted answer was causing issues with Markdown URLs for me, so I had to figure out a regex to extract URLs.投票最多的答案是 Markdown URL 对我来说有问题,所以我不得不想出一个正则表达式来提取 URL。 Below is what I use:以下是我使用的:

URL_REGEX = /(https?:\/\/\S+?)(?:[\s)]|$)/i
content.scan(URL_REGEX).flatten

The last part here (?:[\s)]|$) is used to identify the end of the URL and you can add characters there as per your need and content.这里的最后一部分(?:[\s)]|$)用于标识 URL 的结尾,您可以根据需要和内容在此处添加字符。 Right now it looks for any space characters, closing bracket or end of string.现在它会查找任何空格字符、右括号或字符串结尾。

content = "link in text [link1](http://www.example.com/test) and [link2](http://www.example.com/test2)

http://www.example.com/test3

http://www.example.com/test4"

returns ["http://www.example.com/test", "http://www.example.com/test2", "http://www.example.com/test3", "http://www.example.com/test4"] .返回["http://www.example.com/test", "http://www.example.com/test2", "http://www.example.com/test3", "http://www.example.com/test4"]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM