I'm using a Rake task that runs multiple scraping scripts and exports category data for 35 different cities of a site to 35 different CSV files.
The problem I'm having is that when I run the master Rake task from the root directory of the folder, it creates a new file in the parent directory "resultsForCity.csv" instead of seeing the current CSV file within that given subfolder and adding the data to it. To get around it, I thought I should make my master Rake task (within the parent directory) run slave Rake tasks that then run the scraping scripts, but that didn't work either.
However, if I cd
into one of the city folders and run the scraper or Rake task from there, it adds the data to the corresponding CSV file located within that subfolder. Am I not clearly defining dependencies or something else?
Things I've tried:
Here's my Rake task code:
require "rake"
task default: %w[getData]
task :getData do
Rake::FileList.new("**/*.rb*").each do |file|
ruby file
end
end
And here's my scraper code:
require "nokogiri"
require "open-uri"
require "csv"
url = "http:// example.com/atlanta"
doc = Nokogiri::HTML(open(url))
CSV.open("resultsForAtlanta.csv", "wb") do |csv|
doc.css(".tile-title").each do |item|
csv << [item.text.tr("[()]+0-9", ""), item.text.tr("^0-9$", "")]
end
doc.css(".tile-subcategory").each do |tile|
csv << [tile.text.tr("[()]+0-9", ""), tile.text.tr("^0-9$", "")]
end
end
Any help would be more than greatly appreciated.
What if you let your scraper script take an output filename and use the directory structure to help you build the output filenames.
Assuming you have a directory tree something like
Atlanta/scraper.rb
LosAngeles/scraper.rb
...
where scraper.rb is your scraping script, you should be able to write the task somewhat like this:
task :getData do
Rake::FileList.new("**/scraper.rb").each do |scraper_script|
dir = File.dirname(file)
city = File.basename(dir)
csv_file = File.join(dir, "resultsFor#{city}.csv")
ruby [scraper_script, csv_file].join(" ")
end
end
and then your Ruby script could just grab the filename off the command line like this:
CSV.open(ARGV[1], "wb") do |csv|
...
end
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.