简体   繁体   English

Nokogiri排除HTML类

[英]Nokogiri Exclude HTML Class

I'm trying to scrape the names of all the people who commented on a post in our Facebook group. 我正在努力搜寻所有在我们的Facebook组中发表过评论的人的名字。 I downloaded the file locally and am able to scrape the names of the people who commented plus the people who replied to those comments. 我在本地下载了文件,并且能够抓取评论者的姓名以及回复这些评论的人员的姓名。 I only want the original comments, not the replies... it seems like I have to exclude the UFIReplyList class but my code is still pulling all the names. 我只想要原始注释,而不是答复...似乎我必须排除UFIReplyList类,但我的代码仍在提取所有名称。 Any help would be greatly appreciated. 任何帮助将不胜感激。 Thanks! 谢谢!

require 'nokogiri'
require 'pry'

class Scraper
  @@all = []

  def get_page
    file = File.read('/Users/mark/Desktop/raffle.html')
    doc = Nokogiri::HTML(file)
    # binding.pry

    doc.css(".UFICommentContent").each do |post|
      # binding.pry
      author = post.css(".UFICommentActorName").css(":not(.UFIReplyList)").text

      @@all << author
    end

    puts @@all
  end
end

Scraper.new.get_page

Traverse ancestors for every .UFICommentActorName element, to reject those contained within a .UFIReplyList element. 遍历每个.UFICommentActorName元素的祖先,以拒绝包含在.UFIReplyList元素中的.UFIReplyList

@authors_nodes = doc.css(".UFICommentActorName").reject do |node|

  # extract all ancestor class names; 
  # beware of random whitespace and multiple classes per node
  class_names = node.ancestors.map{ |a| a.attributes['class'].value rescue nil }
  class_names = class_names.compact.map{ |names| names.split(' ') }
  class_names = class_names.flatten.map(&:strip)

  # reject if .UFIReplyList found
  class_names.include?('UFIReplyList')

end

@authors_nodes.map(&:text)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM