简体   繁体   English

从字符串(Ruby)中提取URL(正则表达式和链接已缩短)

[英]Extract URLs from String (Ruby) (Regex and link shortened)

I heard that URI::extract() only returns links with a : , however since I am grabbing a tweet, and it does not contain a : , I believe I would have to use a regular expression. 我听说URI::extract()仅返回带有:链接,但是由于我抓取了一条推文,并且其中不包含: ,我相信我必须使用正则表达式。 I need to check for a "swoo.sh/whatever" link, and store it to a variable. 我需要检查“ swoo.sh/whatever”链接,并将其存储到变量中。 However, how could I look for the first (which it returns automatically apparently), "swoo.sh/whatever" link, in regards to that I have to maintain everything after the / . 但是,关于我必须维护/之后的所有内容,我该如何查找第一个(显然会自动返回)“ swoo.sh/whatever”链接。 For example, if the tweet says 例如,如果鸣叫说

Lorem ipsum lorem ipsum swoo.sh/12xfsW lorem ipsum Lorem ipsum lorem ipsum swoo.sh/12xfsW lorem ipsum

How would I grab the swoo.sh link, and all the different things that come directly after the / ? 我将如何获取swoo.sh链接以及/之后的所有其他内容?

Here is one approach using match : 这是使用match一种方法:

match = /(\w+\.\w+\/\w+)/.match("Lorem ipsum lorem ipsum swoo.sh/12xfsW lorem ipsum")
if match
    puts match[1]
else
    puts "no match"
end

Demo 演示

If you also need the simultaneous ability to capture full URLs, then my answer would have to be updated. 如果您还需要同时具有捕获完整URL的功能,则必须更新我的答案。 This only answers your immediate question. 这只能回答您的直接问题。

We can use the fact that URIs can't contain spaces and Ruby has URI::Generic which will parse almost anything that looks URI-ish. 我们可以使用以下事实:URI不能包含空格,而Ruby具有URI :: Generic ,它将解析几乎所有看起来像URI的内容。 Then we just need to filter out non-web-URIs, which I do by assuming that every web URI has to start with something like foo.bar 然后,我们只需要过滤掉非Web URI,我假设每个Web URI必须以foo.bar东西foo.bar

require 'uri'
require 'pathname'

tweet.
  split.
  map { |s| URI.parse(s) rescue nil }.
  select { |u| u && (u.hostname || Pathname(u.path).each_filename.first =~ /\w\.\w/) }

Example output 输出示例

tweet = 'foo . < google.com bar swoosh.sh/blah?q=bar http://google.com/bar'
# the above returns
# [#<URI::Generic google.com>, #<URI::Generic swoosh.sh/blah?q=bar>, #<URI::HTTP http://google.com/bar>]

This can't really work in general because of ambiguity. 由于模棱两可,这通常无法正常工作。 "car.net" looks like a shortened link, but in context it could be "my neighbor threw a baseball through my window so i yanked the hubcabs off his car.net gain!!!", where it's clearly just a missing space. “ car.net”看起来像是一个缩短的链接,但在上下文中,可能是“我的邻居从我的窗户扔了一个棒球,所以我把胡扯停在了他的car.net上!!!”,这显然是一个缺失的空间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM