用Nokogiri解析HTML字符串

Question

I'm trying to write a ruby script that parses an HTML string and gets some values from specific nodes. 我正在尝试编写一个ruby脚本，该脚本解析HTML字符串并从特定节点获取一些值。

Currently I'm struggling with just reading the string into a Nokogiri document: 目前，我正努力将字符串读入Nokogiri文档中：

This code: 这段代码：

#!/usr/bin/ruby

html_doc = Nokogiri::HTML("<html>  <meta content="text/html; charset=UTF-8"/>  <body style='margin:20px'>    <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p>    <ul style='list-style-type:none; margin:25px 15px;'>      <li><b>User name:</b> Test User</li>      <li><b>User email:</b> test@abc.com</li>      <li><b>Identifier:</b> abc123def132afd1213afas</li>      <li><b>Description:</b> Tom's iPad</li>      <li><b>Model:</b> iPad 3</li>      <li><b>Platform:</b> </li>      <li><b>App:</b> Test app name</li>      <li><b>UserID:</b> </li>     </ul>    <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p>            <hr style='height=2px; color:#aaa'/>        <p>We hope you enjoy the app store experience!</p>        <p style='font-size:18px; color:#999'>Powered by App47</p>      <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>")

Produces this error: 产生此错误：

$ ruby emailParser.rb 
emailParser.rb:3: syntax error, unexpected tIDENTIFIER, expecting ')'
...ML("<html>  <meta content="text/html; charset=UTF-8"/>  <bod...
...                               ^
emailParser.rb:3: syntax error, unexpected tSTRING_BEG, expecting end-of-input
...tent="text/html; charset=UTF-8"/>  <body style='margin:20px'...
...                               ^

Note that I have tried the solution here with the same result: 请注意，我在这里尝试了具有相同结果的解决方案：

"syntax error, unexpected tIDENTIFIER, expecting $end" “语法错误，意外的tIDENTIFIER，期望$ end”

Answer 1

You have to change html string quotes from " to ' and change string quotes inside html to ". 你必须HTML字符串引号从“ 里面的html为“求变字符串引号”改变。 Something like this should work: 这样的事情应该起作用：

#!/usr/bin/ruby

html_doc = Nokogiri::HTML('<html>  <meta content="text/html; charset=UTF-8"/>  <body style="margin:20px">    <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p>    <ul style="list-style-type:none; margin:25px 15px;">      <li><b>User name:</b> Test User</li>      <li><b>User email:</b> test@abc.com</li>      <li><b>Identifier:</b> abc123def132afd1213afas</li>      <li><b>Description:</b> Tom\'s iPad</li>      <li><b>Model:</b> iPad 3</li>      <li><b>Platform:</b> </li>      <li><b>App:</b> Test app name</li>      <li><b>UserID:</b> </li>     </ul>    <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p>            <hr style="height=2px; color:#aaa"/>        <p>We hope you enjoy the app store experience!</p>        <p style="font-size:18px; color:#999">Powered by App47</p>      <img src="https://cirrus.app47.com/notifications/562506219ac25b1033000904/img" alt=""/></body></html>')

Answer 2

The problem is that you have double-quotes within your string which are confusing the parser, because you're also using double-quotes to surround the string. 问题是您的字符串中有双引号，这会使解析器感到困惑，因为您还使用了双引号将字符串引起来。 To illustrate: 为了显示：

puts "foo"bar"
# => SyntaxError: unexpected tIDENTIFIER, expecting end-of-input
#    puts "foo"bar"
#                 ^

You might intend for this to print foo"bar , but when the parser gets to the second " (after foo ) it thinks the string is over, and so the stuff after it causes a syntax error. 您可能打算打印foo"bar ，但是当解析器到达第二个" （在foo之后）时，它认为字符串已结束，因此它后面的内容会导致语法错误。 (Stack Overflow's syntax highlighting even gives you a hint—see how on the first line "foo" is colored differently from bar" ? A good syntax-highlighting text editor will do the same thing.) （Stack Overflow的语法高亮显示甚至给了您提示-查看第一行"foo"的颜色与bar"颜色不同吗？一个出色的语法高亮的文本编辑器将执行相同的操作。）

One solution is to use a single-quote instead: 一种解决方案是改用单引号：

puts 'bar"baz'
# => bar"baz

That fixes the problem in this case, but won't actually help you because your string also has single-quotes inside it! 在这种情况下，可以解决此问题，但实际上不会为您提供帮助，因为您的字符串中也包含单引号！

Another solution is to escape your quotation marks by preceding them with a \\ , like so: 另一种解决方案是通过在\\前面加上引号来转义引号，如下所示：

puts "foo\"bar"
# => foo"bar

...but that gets a little tedious (and sometimes tricky) for long strings like yours. ...但是对于像您这样的长字符串，这会有点乏味（有时会很棘手）。 A better solution is to use a special kind of string called a "heredoc" (for "here document," for what it's worth): 更好的解决方案是使用一种特殊的字符串，称为“ heredoc”（对于“ here document”而言，其价值）：

str = <<-END_OF_HTML
  <html>  <meta content="text/html; charset=UTF-8"/>  <body style='margin:20px'>    <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p>    <ul style='list-style-type:none; margin:25px 15px;'>      <li><b>User name:</b> Test User</li>      <li><b>User email:</b> test@abc.com</li>      <li><b>Identifier:</b> abc123def132afd1213afas</li>      <li><b>Description:</b> Tom's iPad</li>      <li><b>Model:</b> iPad 3</li>      <li><b>Platform:</b> </li>      <li><b>App:</b> Test app name</li>      <li><b>UserID:</b> </li>     </ul>    <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p>            <hr style='height=2px; color:#aaa'/>        <p>We hope you enjoy the app store experience!</p>        <p style='font-size:18px; color:#999'>Powered by App47</p>      <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML

html_doc = Nokogiri::HTML(str)

The delimiter " END_OF_HTML " is arbitrary. 分隔符“ END_OF_HTML ”是任意的。 You could use EOF or XYZZY or whatever suits your fancy instead, although it's a good idea to use something meaningful. 您可以使用EOF或XYZZY或任何适合您的想法，尽管使用有意义的东西是一个好主意。 (You'll notice that Stack Overflow's syntax highlighting has a little trouble with heredocs; most code editors do fine with them, though.) （您会注意到，Stack Overflow的语法突出显示在heredocs上有一些麻烦；不过，大多数代码编辑器对此都很好。）

You can make this a little more compact like this: 您可以像这样使它紧凑一些：

Nokogiri::HTML <<-END_OF_HTML
  <html>  <meta content="text/html; charset=UTF-8"/>  <body style='margin:20px'>    <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p>    <ul style='list-style-type:none; margin:25px 15px;'>      <li><b>User name:</b> Test User</li>      <li><b>User email:</b> test@abc.com</li>      <li><b>Identifier:</b> abc123def132afd1213afas</li>      <li><b>Description:</b> Tom's iPad</li>      <li><b>Model:</b> iPad 3</li>      <li><b>Platform:</b> </li>      <li><b>App:</b> Test app name</li>      <li><b>UserID:</b> </li>     </ul>    <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p>            <hr style='height=2px; color:#aaa'/>        <p>We hope you enjoy the app store experience!</p>        <p style='font-size:18px; color:#999'>Powered by App47</p>      <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML

Or with parentheses (it looks a little odd, but it works, and is sometimes necessary): 或带括号（看起来有些奇怪，但可以，有时是必需的）：

Nokogiri::HTML(<<-END_OF_HTML)
  <html>  <meta content="text/html; charset=UTF-8"/>  <body style='margin:20px'>    <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p>    <ul style='list-style-type:none; margin:25px 15px;'>      <li><b>User name:</b> Test User</li>      <li><b>User email:</b> test@abc.com</li>      <li><b>Identifier:</b> abc123def132afd1213afas</li>      <li><b>Description:</b> Tom's iPad</li>      <li><b>Model:</b> iPad 3</li>      <li><b>Platform:</b> </li>      <li><b>App:</b> Test app name</li>      <li><b>UserID:</b> </li>     </ul>    <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p>            <hr style='height=2px; color:#aaa'/>        <p>We hope you enjoy the app store experience!</p>        <p style='font-size:18px; color:#999'>Powered by App47</p>      <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML

You can read more about heredocs, and other ways to represent strings, in the Literals section of the Ruby documentation. 您可以在Ruby文档的“ 文字”部分中阅读有关heredocs以及其他表示字符串的其他方式的信息。

用Nokogiri解析HTML字符串

问题描述

2 个解决方案

解决方案1
1 2015-10-19 17:39:39

解决方案2
1 已采纳 2015-10-19 17:41:33

用Nokogiri解析HTML字符串

问题描述

2 个解决方案

解决方案1 1 2015-10-19 17:39:39

解决方案2 1 已采纳 2015-10-19 17:41:33

解决方案1
1 2015-10-19 17:39:39

解决方案2
1 已采纳 2015-10-19 17:41:33