简体   繁体   中英

<div><ul><li><div> SCRAPING RUBY with REGEX /w

I am looking to do scraping of the website https://www.bananatic.com/es/forum/games/

and extract the tags "name", "views" and "replies". I have a big problem to get the non-empty content of the "name" tag. Can you help me? I need to save only the elements that do have text.

This is my code, I have three variables:

  • per save what is inside the replies .
  • pir save what is inside the views
  • res saves what is inside the names.

Each array should contain only the elements that they have. something but in the names the writings [" "] are saved and I want them not to be saved in my array. 在此处输入图像描述

    require 'nokogiri'
    require 'open-uri'
    require 'pp'
    require 'csv'


    unless File.readable?('data.html')
      url = 'https://www.bananatic.com/de/forum/games/'
      data = URI.open(url).read
      File.open('data.html', 'wb') { |f| f << data }
    end
    data = File.read('data.html')
    document = Nokogiri::HTML(data)


    per = document.xpath('//div[@class="replies"]/text()[string-length(normalize-space(.)) > 0]')
                  .map { |node| node.to_s[/\d+/] }

    p per

    pir = document.xpath('//div[@class="views"]/text()[string-length(normalize-space(.)) > 0]')
                  .map { |node| node.to_s[/\w+/] }

    p pir

    links2 = document.css('.topics ul li div')
    res = links2.map do |lk|
      name = lk.css('.name  p a').inner_text
      [name]
    end
    p res

To fix it I have added a regular expression, however I have failed in the attempt. I just replace .inner_text with .to_s[/\w+/] , but I don't get it. 在此处输入图像描述

Now I have an array with null values and some letters "a" that I don't know where they appear.

在此处输入图像描述

This Might Help XPath and CSS .

For your CSS check this out: https://kittygiraudel.github.io/selectors-explained/

The following will get you what you are looking for

document.xpath('//div[@class="topics"]/ul/li//div[@class="name"]/a[@class="js-link avatar"]/text()').map {|node| node.to_s.strip}`.

If you want to understand where your array is coming from take 1 step back and just print out lk.css('.name p a').to_s but the real issue is your selectors are just off.

All that being said looking at the construct of the page you would be better off with something like this:

require 'nokogiri'
require 'open-uri'

url = "https://www.bananatic.com/de/forum/games/"

doc = Nokogiri::HTML(URI.open(url))
# Set a root node set to start from
topics = doc.xpath('//div[@class="topics"]/ul/li')

# loop the set 
details = topics.filter_map do |topic| 
  next unless topic.at_xpath('.//div[@class="name"]') # skip ones without the needed info
  # Map details into a Hash
  {name: topic.at_xpath('.//div[@class="name"]/a[@class="js-link avatar"]/text()').to_s.strip,
   post_year: topic.at_xpath('.//div[@class="name"]/text()[string-length(normalize-space(.)) > 0]').to_s[/\d{4}/],
   replies: topic.at_xpath('.//div[@class="replies"]/text()').to_s.strip, 
   views: topic.at_xpath('.//div[@class="views"]/text()').to_s.strip 
  }
end

The result of details would be:

[{:name=>"MrCasual2502", :post_year=>"2016", :replies=>"0", :views=>"236"},
 {:name=>"MrCasual2502", :post_year=>"2016", :replies=>"0", :views=>"164"},
 {:name=>"EdgarAllen", :post_year=>"2022", :replies=>"0", :views=>"1"},
 {:name=>"RAMONVC", :post_year=>"2022", :replies=>"0", :views=>"0"},
 {:name=>"RAMONVC", :post_year=>"2022", :replies=>"0", :views=>"1"},
 {:name=>"tokyobreez", :post_year=>"2021", :replies=>"2", :views=>"18"},
 {:name=>"matrix12334", :post_year=>"2022", :replies=>"0", :views=>"2"},
 {:name=>"juggalohomie420", :post_year=>"2017", :replies=>"3", :views=>"89"},
 {:name=>"Imas86", :post_year=>"2022", :replies=>"2", :views=>"2"},
 {:name=>"SmilesImposterr", :post_year=>"2021", :replies=>"1", :views=>"17"},
 {:name=>"bebb", :post_year=>"2019", :replies=>"7", :views=>"22"},
 {:name=>"IMBANANAZ", :post_year=>"2016", :replies=>"1", :views=>"4"},
 {:name=>"IWantSteamKeys", :post_year=>"2021", :replies=>"1", :views=>"4"},
 {:name=>"gamormoment", :post_year=>"2021", :replies=>"1", :views=>"47"},
 {:name=>"Lovestruck", :post_year=>"2021", :replies=>"3", :views=>"46"},
 {:name=>"KillerBotAldwin1", :post_year=>"2021", :replies=>"1", :views=>"95"},
 {:name=>"purplevestynstr", :post_year=>"2020", :replies=>"1", :views=>"13"},
 {:name=>"Janabanana", :post_year=>"2021", :replies=>"3", :views=>"3"},
 {:name=>"apache724", :post_year=>"2017", :replies=>"3", :views=>"33"},
 {:name=>"MrsSue66", :post_year=>"2021", :replies=>"1", :views=>"38"}]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM