<div><ul><li><div> SCRAPING RUBY with REGEX /w

Question

I am looking to do scraping of the website https://www.bananatic.com/es/forum/games/

and extract the tags "name", "views" and "replies". I have a big problem to get the non-empty content of the "name" tag. Can you help me? I need to save only the elements that do have text.

This is my code, I have three variables:

per save what is inside the replies .
pir save what is inside the views
res saves what is inside the names.

Each array should contain only the elements that they have. something but in the names the writings [" "] are saved and I want them not to be saved in my array.

    require 'nokogiri'
    require 'open-uri'
    require 'pp'
    require 'csv'


    unless File.readable?('data.html')
      url = 'https://www.bananatic.com/de/forum/games/'
      data = URI.open(url).read
      File.open('data.html', 'wb') { |f| f << data }
    end
    data = File.read('data.html')
    document = Nokogiri::HTML(data)


    per = document.xpath('//div[@class="replies"]/text()[string-length(normalize-space(.)) > 0]')
                  .map { |node| node.to_s[/\d+/] }

    p per

    pir = document.xpath('//div[@class="views"]/text()[string-length(normalize-space(.)) > 0]')
                  .map { |node| node.to_s[/\w+/] }

    p pir

    links2 = document.css('.topics ul li div')
    res = links2.map do |lk|
      name = lk.css('.name  p a').inner_text
      [name]
    end
    p res

To fix it I have added a regular expression, however I have failed in the attempt. I just replace .inner_text with .to_s[/\w+/] , but I don't get it.

Now I have an array with null values and some letters "a" that I don't know where they appear.

Answer 1

This Might Help XPath and CSS .

For your CSS check this out: https://kittygiraudel.github.io/selectors-explained/

The following will get you what you are looking for

document.xpath('//div[@class="topics"]/ul/li//div[@class="name"]/a[@class="js-link avatar"]/text()').map {|node| node.to_s.strip}`.

If you want to understand where your array is coming from take 1 step back and just print out lk.css('.name p a').to_s but the real issue is your selectors are just off.

All that being said looking at the construct of the page you would be better off with something like this:

require 'nokogiri'
require 'open-uri'

url = "https://www.bananatic.com/de/forum/games/"

doc = Nokogiri::HTML(URI.open(url))
# Set a root node set to start from
topics = doc.xpath('//div[@class="topics"]/ul/li')

# loop the set 
details = topics.filter_map do |topic| 
  next unless topic.at_xpath('.//div[@class="name"]') # skip ones without the needed info
  # Map details into a Hash
  {name: topic.at_xpath('.//div[@class="name"]/a[@class="js-link avatar"]/text()').to_s.strip,
   post_year: topic.at_xpath('.//div[@class="name"]/text()[string-length(normalize-space(.)) > 0]').to_s[/\d{4}/],
   replies: topic.at_xpath('.//div[@class="replies"]/text()').to_s.strip, 
   views: topic.at_xpath('.//div[@class="views"]/text()').to_s.strip 
  }
end

The result of details would be:

[{:name=>"MrCasual2502", :post_year=>"2016", :replies=>"0", :views=>"236"},
 {:name=>"MrCasual2502", :post_year=>"2016", :replies=>"0", :views=>"164"},
 {:name=>"EdgarAllen", :post_year=>"2022", :replies=>"0", :views=>"1"},
 {:name=>"RAMONVC", :post_year=>"2022", :replies=>"0", :views=>"0"},
 {:name=>"RAMONVC", :post_year=>"2022", :replies=>"0", :views=>"1"},
 {:name=>"tokyobreez", :post_year=>"2021", :replies=>"2", :views=>"18"},
 {:name=>"matrix12334", :post_year=>"2022", :replies=>"0", :views=>"2"},
 {:name=>"juggalohomie420", :post_year=>"2017", :replies=>"3", :views=>"89"},
 {:name=>"Imas86", :post_year=>"2022", :replies=>"2", :views=>"2"},
 {:name=>"SmilesImposterr", :post_year=>"2021", :replies=>"1", :views=>"17"},
 {:name=>"bebb", :post_year=>"2019", :replies=>"7", :views=>"22"},
 {:name=>"IMBANANAZ", :post_year=>"2016", :replies=>"1", :views=>"4"},
 {:name=>"IWantSteamKeys", :post_year=>"2021", :replies=>"1", :views=>"4"},
 {:name=>"gamormoment", :post_year=>"2021", :replies=>"1", :views=>"47"},
 {:name=>"Lovestruck", :post_year=>"2021", :replies=>"3", :views=>"46"},
 {:name=>"KillerBotAldwin1", :post_year=>"2021", :replies=>"1", :views=>"95"},
 {:name=>"purplevestynstr", :post_year=>"2020", :replies=>"1", :views=>"13"},
 {:name=>"Janabanana", :post_year=>"2021", :replies=>"3", :views=>"3"},
 {:name=>"apache724", :post_year=>"2017", :replies=>"3", :views=>"33"},
 {:name=>"MrsSue66", :post_year=>"2021", :replies=>"1", :views=>"38"}]

<div><ul><li><div> SCRAPING RUBY with REGEX /w

Question

1 answers

solution1
1 ACCPTED 2023-01-20 21:59:36

<div><ul><li><div> SCRAPING RUBY with REGEX /w

Question

1 answers

solution1 1 ACCPTED 2023-01-20 21:59:36

solution1
1 ACCPTED 2023-01-20 21:59:36