[英]Extract URLs from String (Ruby) (Regex and link shortened)
I heard that URI::extract()
only returns links with a :
, however since I am grabbing a tweet, and it does not contain a :
, I believe I would have to use a regular expression. 我听说
URI::extract()
仅返回带有:
链接,但是由于我抓取了一条推文,并且其中不包含:
,我相信我必须使用正则表达式。 I need to check for a "swoo.sh/whatever" link, and store it to a variable. 我需要检查“ swoo.sh/whatever”链接,并将其存储到变量中。 However, how could I look for the first (which it returns automatically apparently), "swoo.sh/whatever" link, in regards to that I have to maintain everything after the
/
. 但是,关于我必须维护
/
之后的所有内容,我该如何查找第一个(显然会自动返回)“ swoo.sh/whatever”链接。 For example, if the tweet says 例如,如果鸣叫说
Lorem ipsum lorem ipsum swoo.sh/12xfsW lorem ipsum
Lorem ipsum lorem ipsum swoo.sh/12xfsW lorem ipsum
How would I grab the swoo.sh link, and all the different things that come directly after the /
? 我将如何获取swoo.sh链接以及
/
之后的所有其他内容?
Here is one approach using match
: 这是使用
match
一种方法:
match = /(\w+\.\w+\/\w+)/.match("Lorem ipsum lorem ipsum swoo.sh/12xfsW lorem ipsum")
if match
puts match[1]
else
puts "no match"
end
If you also need the simultaneous ability to capture full URLs, then my answer would have to be updated. 如果您还需要同时具有捕获完整URL的功能,则必须更新我的答案。 This only answers your immediate question.
这只能回答您的直接问题。
We can use the fact that URIs can't contain spaces and Ruby has URI::Generic which will parse almost anything that looks URI-ish. 我们可以使用以下事实:URI不能包含空格,而Ruby具有URI :: Generic ,它将解析几乎所有看起来像URI的内容。 Then we just need to filter out non-web-URIs, which I do by assuming that every web URI has to start with something like
foo.bar
然后,我们只需要过滤掉非Web URI,我假设每个Web URI必须以
foo.bar
东西foo.bar
require 'uri'
require 'pathname'
tweet.
split.
map { |s| URI.parse(s) rescue nil }.
select { |u| u && (u.hostname || Pathname(u.path).each_filename.first =~ /\w\.\w/) }
Example output 输出示例
tweet = 'foo . < google.com bar swoosh.sh/blah?q=bar http://google.com/bar'
# the above returns
# [#<URI::Generic google.com>, #<URI::Generic swoosh.sh/blah?q=bar>, #<URI::HTTP http://google.com/bar>]
This can't really work in general because of ambiguity. 由于模棱两可,这通常无法正常工作。 "car.net" looks like a shortened link, but in context it could be "my neighbor threw a baseball through my window so i yanked the hubcabs off his car.net gain!!!", where it's clearly just a missing space.
“ car.net”看起来像是一个缩短的链接,但在上下文中,可能是“我的邻居从我的窗户扔了一个棒球,所以我把胡扯停在了他的car.net上!!!”,这显然是一个缺失的空间。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.