简体   繁体   English

如何使用Ruby正则表达式从HTML内容中提取URL?

[英]How can I extract URLs from HTML content with a Ruby regexp?

This is an example since it is not easy to explain: 这是一个示例,因为不容易解释:

<li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> <a href="javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com');%20" onclick="visited('f6a1ok3n4d4p');" style="float:left;">random strings - 4</a> <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&amp;a_aid=&amp;a_bid=&amp;chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4  site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015  | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style

In the above content I want to extract from 在以上内容中,我想从中提取

javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')

the string "f6a1ok3n4d4p" and "site2.com" then make it as 字符串"f6a1ok3n4d4p""site2.com"将其设置为

http://site2.com/f6a1ok3n4d4p

and same for 和相同的

javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com')

to become 成为

http://site1.com/zsgn82c4b96d

I need it to be done with Ruby regex. 我需要使用Ruby正则表达式来完成。

You can proceed like this: 您可以这样进行:

require 'uri'
str = "javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')"

# regex scan to get values within javascript:show
vals = str.scan(/javascript:show\((.*)\)/)[0][0].split(',')
# => ["'f6a1ok3n4d4p'", "'random%20strings%204'", "%20'site2.com'"]

# joining resultant Array elements to generate url
url = "http://" +  URI.decode(a.last).tr("'", '').strip + "/" + a.first.tr("'", '')
# => "http://site2.com/f6a1ok3n4d4p"

obviously my answer is not foolproof. 显然,我的回答并非万无一失。 You can make it better with checks for what if scan returns [] ? 您可以通过检查scan返回[]来更好地进行检查。

This should do the trick, though the regexp isn't particularly flexible. 尽管regexp不是特别灵活,但这应该可以解决问题。

js_link_regex = /href=\"javascript:show\('([^']+)','[^']+',%20'([^']+)'\)/
link = <<eos
  <li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> <a href="javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com');%20" onclick="visited('f6a1ok3n4d4p');" style="float:left;">random strings - 4</a> <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&amp;a_aid=&amp;a_bid=&amp;chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4  site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015  | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style
eos

matches = link.scan(js_link_regex)
matches.each do |match|
  puts "http://#{match[1]}/#{match[0]}"
end

To just match your case, 为了配合您的情况,

str = "javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')"

parts = str.scan(/'([\w|\.]+)'/).flatten # => ["f6a1ok3n4d4p", "site2.com"]

puts "http://#{parts[1]}/#{parts[0]}" # => http://site2.com/f6a1ok3n4d4p

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM