I am looking to do scraping of the website https://www.bananatic.com/es/forum/games/
and extract the tags "name", "views" and "replies". I have a big problem to get the non-empty content of the "name" tag. Can you help me? I need to save only the elements that do have text.
This is my code, I have three variables:
Each array should contain only the elements that they have. something but in the names the writings [" "] are saved and I want them not to be saved in my array.
require 'nokogiri'
require 'open-uri'
require 'pp'
require 'csv'
unless File.readable?('data.html')
url = 'https://www.bananatic.com/de/forum/games/'
data = URI.open(url).read
File.open('data.html', 'wb') { |f| f << data }
end
data = File.read('data.html')
document = Nokogiri::HTML(data)
per = document.xpath('//div[@class="replies"]/text()[string-length(normalize-space(.)) > 0]')
.map { |node| node.to_s[/\d+/] }
p per
pir = document.xpath('//div[@class="views"]/text()[string-length(normalize-space(.)) > 0]')
.map { |node| node.to_s[/\w+/] }
p pir
links2 = document.css('.topics ul li div')
res = links2.map do |lk|
name = lk.css('.name p a').inner_text
[name]
end
p res
To fix it I have added a regular expression, however I have failed in the attempt. I just replace .inner_text with .to_s[/\w+/] , but I don't get it.
Now I have an array with null values and some letters "a" that I don't know where they appear.
This Might Help XPath and CSS .
For your CSS check this out: https://kittygiraudel.github.io/selectors-explained/
The following will get you what you are looking for
document.xpath('//div[@class="topics"]/ul/li//div[@class="name"]/a[@class="js-link avatar"]/text()').map {|node| node.to_s.strip}`.
If you want to understand where your array is coming from take 1 step back and just print out lk.css('.name p a').to_s
but the real issue is your selectors are just off.
All that being said looking at the construct of the page you would be better off with something like this:
require 'nokogiri'
require 'open-uri'
url = "https://www.bananatic.com/de/forum/games/"
doc = Nokogiri::HTML(URI.open(url))
# Set a root node set to start from
topics = doc.xpath('//div[@class="topics"]/ul/li')
# loop the set
details = topics.filter_map do |topic|
next unless topic.at_xpath('.//div[@class="name"]') # skip ones without the needed info
# Map details into a Hash
{name: topic.at_xpath('.//div[@class="name"]/a[@class="js-link avatar"]/text()').to_s.strip,
post_year: topic.at_xpath('.//div[@class="name"]/text()[string-length(normalize-space(.)) > 0]').to_s[/\d{4}/],
replies: topic.at_xpath('.//div[@class="replies"]/text()').to_s.strip,
views: topic.at_xpath('.//div[@class="views"]/text()').to_s.strip
}
end
The result of details
would be:
[{:name=>"MrCasual2502", :post_year=>"2016", :replies=>"0", :views=>"236"},
{:name=>"MrCasual2502", :post_year=>"2016", :replies=>"0", :views=>"164"},
{:name=>"EdgarAllen", :post_year=>"2022", :replies=>"0", :views=>"1"},
{:name=>"RAMONVC", :post_year=>"2022", :replies=>"0", :views=>"0"},
{:name=>"RAMONVC", :post_year=>"2022", :replies=>"0", :views=>"1"},
{:name=>"tokyobreez", :post_year=>"2021", :replies=>"2", :views=>"18"},
{:name=>"matrix12334", :post_year=>"2022", :replies=>"0", :views=>"2"},
{:name=>"juggalohomie420", :post_year=>"2017", :replies=>"3", :views=>"89"},
{:name=>"Imas86", :post_year=>"2022", :replies=>"2", :views=>"2"},
{:name=>"SmilesImposterr", :post_year=>"2021", :replies=>"1", :views=>"17"},
{:name=>"bebb", :post_year=>"2019", :replies=>"7", :views=>"22"},
{:name=>"IMBANANAZ", :post_year=>"2016", :replies=>"1", :views=>"4"},
{:name=>"IWantSteamKeys", :post_year=>"2021", :replies=>"1", :views=>"4"},
{:name=>"gamormoment", :post_year=>"2021", :replies=>"1", :views=>"47"},
{:name=>"Lovestruck", :post_year=>"2021", :replies=>"3", :views=>"46"},
{:name=>"KillerBotAldwin1", :post_year=>"2021", :replies=>"1", :views=>"95"},
{:name=>"purplevestynstr", :post_year=>"2020", :replies=>"1", :views=>"13"},
{:name=>"Janabanana", :post_year=>"2021", :replies=>"3", :views=>"3"},
{:name=>"apache724", :post_year=>"2017", :replies=>"3", :views=>"33"},
{:name=>"MrsSue66", :post_year=>"2021", :replies=>"1", :views=>"38"}]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.