Using regex to parse line breaks and dollar signs in scraped image src using Nokogiri

Question

I am using nokogiri in my rails 4 app to scrape images from websites and some of them give me unexpected '$' after '' errors.

For instance, here is one sample image url output:

  <img src="http://x.example.com/images/detail/ln9502/1_ln-9502---

  grh_375.jpg" alt="" style="display: block;">

I suspect it is the line break that is giving me trouble?

Here is another:

  <img class="abc" src="http://xxx.example.com/is/image/Sample/503508739_1?$sample_size$">

I suspect it is the dollar signs giving me issues here.

Here is what I have in one of my controllers that is saving the image:

  item_imageurl = page.search(library.image_selector).first.attribute('src').value(/(.|\n|\r)*/).to_s

Where I have items that belong to a library and I set the css selector in each library. Any ideas on what regex I could use to ignore line breaks and dollar signs, unless there's a simpler solution?

Answer 1

You can remove new lines and whitespace from a string with .gsub .

  item_imageurl = page.search(library.image_selector).first.attribute('src').value().to_s.gsub(/[\n ]/, "")

I'm assuming ...attribute('src').value() returns the contents of the src tag.

For the record, your regex matches the last character of the string. You might want to check out http://regex101.com/ for texting your regular expressions.

Using regex to parse line breaks and dollar signs in scraped image src using Nokogiri

Question

1 answers

solution1
0 2014-05-23 14:50:42

Using regex to parse line breaks and dollar signs in scraped image src using Nokogiri

Question

1 answers

solution1 0 2014-05-23 14:50:42

solution1
0 2014-05-23 14:50:42