简体   繁体   中英

Ruby/Rails: how to get parsed content of HTML file?

I have a Rails 4 app. I am adding a function so that the user can provide a document and within that document, search for certain words. I would like this to work on text as well as HTML. For the HTML to work correctly, I was wondering if there is a Ruby or Rails function that provides the parsed output of an HTML string.

For example, if I have the string <strong>Here</strong> is some <em>HTML</em> , I need a function that will return Here is some HTML . The reason for this is, if I was searching for the string "some HTML", it will not find it in <strong>Here</strong> is some <em>HTML</em> due to the <em> tags. However, if you are viewing the HTML in a browser, the words "some HTML" will be there (albeit with some formatting — I don't care about the formatting).

Just stripping out tags in angle brackets won't work because what if there is input like here are &nbsp;&nbsp;&nbsp; lots of spaces here are &nbsp;&nbsp;&nbsp; lots of spaces ? I need the function to return here are lots of spaces with HTML entities parsed as well.

You want an XML parser. The Nokogiri gem is excellent.

If you don't want to depend on Nokogiri (which takes forever to install), I think you can get a long way with regular expressions.

What you essentially want is the content from the tags, but not the tags. There are exceptions to this through. For instance you'll want to eliminate the content of style tags and script tags. Finally you might actually want to keep some of the attributes from the meta tags.

Here's a regular expression that will eliminate all your tags.

html_string = "<html><p>Hello <strong>world</strong></p></html>"
html_string.gsub(/<[^>]*>/, '')
=> "Hello world"

This regex looks for any < character followed by zero or more characters and then by > , and replaces it with an empty string.

To refine this, you might also want to replace html entities, eg &oslash ; with real unicode characters to make it searchable.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM