简体   繁体   English

解析图像网址nokogiri

[英]Parse image url nokogiri

I need to parse out the image URL from HTML much like the following: 我需要解析HTML中的图像URL,如下所示:

<p><a href="http://blog.website.com/wp-content/uploads/2012/02/image_name.jpg" ><img class="aligncenter size-full wp-image-12313" alt="Example image Name" src="http://blog.website.com/wp-content/uploads/2012/02/image_name.jpg" width="630" height="119" /></a></p>

So far I am using Nokogiri to parse out <h2> tags with: 到目前为止,我使用Nokogiri解析<h2>标签:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

page = Nokogiri::HTML(open("http://blog.website.com/"))
headers = page.css('h2')

puts headers.text

I have two questions: 我有两个问题:

  1. How can I parse out the image url? 我该如何解析图片网址?
  2. Ideally I'd print to the console in this format: 理想情况下,我会以这种格式打印到控制台:
1. 
Header 1
image_url 1
image_url 2 (if any)
 2. 
Header 2
2image_url 1
2image_url 2 (if any)

And so far I haven't been able to print my headers in this nice format. 到目前为止,我还没能用这种漂亮的格式打印标题。 How can I do so? 我怎么能这样做?

<h2><a href="http://blog.website.com/2013/02/15/images/" rel="bookmark" title="Permanent Link to Blog Post">Blog Post</a></h2>
          <p class="post_author"><em>by</em> author</p>
          <div class="format_text">
    <p style="text-align: left;">Blog Content </p>
<p style="text-align: left;"> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </p>
<p style="text-align: center;"><a href="http://blog.website.com/wp-content/uploads/2012/02/image21.jpg" ><img class="alignnone size-full wp-image-23382" alt="image2" src="http://blog.website.com/wp-content/uploads/2012/02/image21.jpg" width="630" height="210" /></a></p>
<p style="text-align: left;">Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Items: <a href="http://www.website.com/threads?src=login#/show/thread/A_abvaf812e3"  target="_blank">Items for Spring</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">More Items: <a href="http://www.website.com/threads#/show/thread/A_abv2a6822e2"  target="_blank">Lorem Ipsum</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Still more items: <a href="http://www.website.com/threads#/show/thread/A_abv7af882e3"  target="_blank">Items:</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Lorem ipsum: <a href="http://www.website.com/threads?src=login#/show/thread/A_abvea6832e8"  target="_blank">Items</a></b></p>
<p style="text-align: center;">Lorem Ipusm</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">
        </div>  
          <p class="to_comments"><span class="date">February 15, 2013</span> &nbsp; <span class="num_comments"><a href="http://blog.website.com/2013/02/15/Blog-post/#respond" title="Comment on Blog Post">No Comments</a></span></p>

I think it makes more sense to group by h2 first: 我认为首先按h2分组更有意义:

doc.search('h2').each_with_index do |h2, i|
  puts "#{i+1}."
  puts h2.text
  h2.search('+ p + div > p[3] img').each do |img|
    puts img['src']
  end
end

To get images, simply look for the img tags with a src attribute. 要获取图像,只需查找带有src属性的img标记即可。

If you want the h2 associated with each image, you can do this: 如果您想要与每个图像关联的h2 ,您可以这样做:

doc.xpath('//img').each do |img|
  puts "Header: #{img.xpath('preceding::h2[1]').text}"
  puts "  Image: #{img['src']}"
end

Note that a switch to XPath was in order for the preceding:: axis. 请注意,切换到XPath是为了preceding:: axis。

EDIT 编辑

To group by header, you can put them in a hash: 要按标题分组,您可以将它们放在哈希中:

headers = Hash.new{|h,k| h[k] = []}
doc.xpath('//img').each do |img|
  header = img.xpath('preceding::h2[1]').text
  image = img['src']
  headers[header] << image
end

To get the output you've prescribed: 为了得到你规定的输出:

headers.each do |h,urls|
  puts "#{h} #{urls.join(' ')}"
end

Code that I ended up using. 我最终使用的代码。 Feel free to critique (I'll probably learn from it): 随意批评(我可能会从中学习):

require 'rubygems'
require 'nokogiri'

doc = Nokogiri::HTML(open("http://blog.website.com/"))

doc.xpath('//h2/a[@rel = "bookmark"]').each_with_index do |header, i|
  puts i+1
  puts " Title: #{header.text}"
  puts "  Image 1: #{header.xpath('following::img[1]')[0]["src"]}"
  puts "  Image 2: #{header.xpath('following::img[2]')[0]["src"]}"
end

I did something similiar once (I wanted the exact same output actually). 我曾做过类似的事情(实际上我想要完全相同的输出)。 This solution is pretty easy to follow: 这个解决方案很容易遵循:

Depending on how the DOM is structured, you could do something like: 根据DOM的结构,您可以执行以下操作:

body = page.css('div.format_text')
headers = page.css('div#content_inner h2 a')
post_counter = 1

body.each_with_index do |body,index| 
   header = headers[index]
   puts "#{post_counter}. " + header
   body.css('p a img, div > img').each{|img| puts img['src'] if img['src'].match(/\Ahttp/) }
   post_counter += 1
end

So basically, you're checking every header with 1 or more images. 所以基本上,你用一个或多个图像检查每个标题。 The page I was parsing had the headers outside of the image divs, which is why I used two different variables to find them (body / headers). 我正在解析的页面有图像div之外的标题,这就是为什么我使用两个不同的变量来查找它们(正文/标题)。 Also, I targeted two classes when looking for images, as this is the way this particular DOM was structured. 此外,我在寻找图像时定位了两个类,因为这是特定DOM的结构方式。

This should give you a nice clean output like you wanted. 这应该给你一个很好的干净输出,就像你想要的那样。

Hope this helps! 希望这可以帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM