简体   繁体   中英

Using regex to parse line breaks and dollar signs in scraped image src using Nokogiri

I am using nokogiri in my rails 4 app to scrape images from websites and some of them give me unexpected '$' after '' errors.

For instance, here is one sample image url output:

  <img src="http://x.example.com/images/detail/ln9502/1_ln-9502---

  grh_375.jpg" alt="" style="display: block;">

I suspect it is the line break that is giving me trouble?

Here is another:

  <img class="abc" src="http://xxx.example.com/is/image/Sample/503508739_1?$sample_size$">

I suspect it is the dollar signs giving me issues here.

Here is what I have in one of my controllers that is saving the image:

  item_imageurl = page.search(library.image_selector).first.attribute('src').value(/(.|\n|\r)*/).to_s

Where I have items that belong to a library and I set the css selector in each library. Any ideas on what regex I could use to ignore line breaks and dollar signs, unless there's a simpler solution?

You can remove new lines and whitespace from a string with .gsub .

  item_imageurl = page.search(library.image_selector).first.attribute('src').value().to_s.gsub(/[\n ]/, "")

I'm assuming ...attribute('src').value() returns the contents of the src tag.

For the record, your regex matches the last character of the string. You might want to check out http://regex101.com/ for texting your regular expressions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM