提取 Ruby 中字符串内的所有 url

Question

I have some text content with a list of URLs contained in it.我有一些文本内容，其中包含一个 URL 列表。

I am trying to grab all the URLs out and put them in an array.我正在尝试抓取所有 URL 并将它们放入一个数组中。

I have this code我有这个代码

content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html"

urls = content.scan(/^(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.*)?$/ix)

I am trying to get the end results to be:我试图让最终结果是：

['http://www.google.com', 'http://www.google.com/index.html']

The above code does not seem to be working correctly.上面的代码似乎没有正常工作。 Does anyone know what I am doing wrong?有谁知道我做错了什么？

Thanks谢谢

Answer 1

Easy: 简单：

ruby-1.9.2-p136 :006 > require 'uri'
ruby-1.9.2-p136 :006 > URI.extract(content, ['http', 'https'])
  => ["http://www.google.com", "http://www.google.com/index.html"]

Answer 2

I haven't checked the syntax of your regex, but String.scan will produce an array, each of whose members is an array of the groups matched by your regex. 我没有检查你的正则表达式的语法，但String.scan将生成一个数组，其每个成员是由你的正则表达式匹配的组的数组。 So I'd expect the result to be: 所以我希望结果如下：

[['http', '.google.com'], ...]

You'll need non-matching groups /(?:stuff)/ if you want the format you've given. 如果你想要你给出的格式，你需要不匹配的组/(?:stuff)/ 。

Edit (looking at regex): Also, your regex does look a bit wrong. 编辑（看正则表达式）：另外，你的正则表达式确实看起来有点不对劲。 You don't want the start and end anchors ( ^ and $ ), since you don't expect the matches to be at start and end of content . 您不需要开始和结束锚点（ ^和$ ），因为您不希望匹配位于content开头和结尾。 Secondly, if your ([0-9]{1,5})? 其次，如果你([0-9]{1,5})? is trying to capture a port number, I think you're missing a colon to separate the domain from the port. 我试图捕获一个端口号，我想你错过了一个冒号来将域与端口分开。

Further edit, after playing: I think you want something like this: 玩完后进一步编辑：我想你想要这样的东西：

content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html http://example.com:3000/foo"
urls = content.scan(/(?:http|https):\/\/[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(?:(?::[0-9]{1,5})?\/[^\s]*)?/ix)
# => ["http://www.google.com", "http://www.google.com/index.html", "http://example.com:3000/foo"]

... but note that it won't match pure IP-address URLs (like http://127.0.0.1 ), because of the [az]{2,5} for the TLD. ...但请注意，它与纯IP地址URL（如http://127.0.0.1 ）不匹配，因为TLD的[az]{2,5} 。

Answer 3

一种不同的方法，从完美是好的敌人的思想流派：

urls = content.split(/\s+/).find_all { |u| u =~ /^https?:/ }

Answer 4

just for your interest: 只是为了你的兴趣：

Ruby has an URI Module, which has a regex implemented to do such things: Ruby有一个URI模块，它有一个正则表达式来实现这样的事情：

require "uri"

uris_you_want_to_grap = ['ftp','http','https','ftp','mailto','see']

html_string.scan(URI.regexp(uris_you_want_to_grap)) do |*matches|
  urls << $&
end

For more information visit the Ruby Ref: URI 有关更多信息，请访问Ruby Ref： URI

Answer 5

The most upvoted answer was causing issues with Markdown URLs for me, so I had to figure out a regex to extract URLs.投票最多的答案是 Markdown URL 对我来说有问题，所以我不得不想出一个正则表达式来提取 URL。 Below is what I use:以下是我使用的：

URL_REGEX = /(https?:\/\/\S+?)(?:[\s)]|$)/i
content.scan(URL_REGEX).flatten

The last part here (?:[\s)]|$) is used to identify the end of the URL and you can add characters there as per your need and content.这里的最后一部分(?:[\s)]|$)用于标识 URL 的结尾，您可以根据需要和内容在此处添加字符。 Right now it looks for any space characters, closing bracket or end of string.现在它会查找任何空格字符、右括号或字符串结尾。

content = "link in text [link1](http://www.example.com/test) and [link2](http://www.example.com/test2)

http://www.example.com/test3

http://www.example.com/test4"

returns ["http://www.example.com/test", "http://www.example.com/test2", "http://www.example.com/test3", "http://www.example.com/test4"] .返回["http://www.example.com/test", "http://www.example.com/test2", "http://www.example.com/test3", "http://www.example.com/test4"] 。

提取 Ruby 中字符串内的所有 url

问题描述

5 个解决方案

解决方案1
46 2011-05-09 16:42:33

解决方案2
5 2010-02-19 15:45:27

解决方案3
5 已采纳 2010-02-19 16:22:10

解决方案4
4 2012-07-23 17:22:27

解决方案5
0 2022-03-23 13:39:54

提取 Ruby 中字符串内的所有 url

问题描述

5 个解决方案

解决方案1 46 2011-05-09 16:42:33

解决方案2 5 2010-02-19 15:45:27

解决方案3 5 已采纳 2010-02-19 16:22:10

解决方案4 4 2012-07-23 17:22:27

解决方案5 0 2022-03-23 13:39:54

解决方案1
46 2011-05-09 16:42:33

解决方案2
5 2010-02-19 15:45:27

解决方案3
5 已采纳 2010-02-19 16:22:10

解决方案4
4 2012-07-23 17:22:27

解决方案5
0 2022-03-23 13:39:54