[英]Parse HTML string with Nokogiri
我正在嘗試編寫一個ruby腳本,該腳本解析HTML字符串並從特定節點獲取一些值。
目前,我正努力將字符串讀入Nokogiri文檔中:
這段代碼:
#!/usr/bin/ruby
html_doc = Nokogiri::HTML("<html> <meta content="text/html; charset=UTF-8"/> <body style='margin:20px'> <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p> <ul style='list-style-type:none; margin:25px 15px;'> <li><b>User name:</b> Test User</li> <li><b>User email:</b> test@abc.com</li> <li><b>Identifier:</b> abc123def132afd1213afas</li> <li><b>Description:</b> Tom's iPad</li> <li><b>Model:</b> iPad 3</li> <li><b>Platform:</b> </li> <li><b>App:</b> Test app name</li> <li><b>UserID:</b> </li> </ul> <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p> <hr style='height=2px; color:#aaa'/> <p>We hope you enjoy the app store experience!</p> <p style='font-size:18px; color:#999'>Powered by App47</p> <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>")
產生此錯誤:
$ ruby emailParser.rb
emailParser.rb:3: syntax error, unexpected tIDENTIFIER, expecting ')'
...ML("<html> <meta content="text/html; charset=UTF-8"/> <bod...
... ^
emailParser.rb:3: syntax error, unexpected tSTRING_BEG, expecting end-of-input
...tent="text/html; charset=UTF-8"/> <body style='margin:20px'...
... ^
請注意,我在這里嘗試了具有相同結果的解決方案:
你必須HTML字符串引號從“ 里面的html為“求變字符串引號”改變。 這樣的事情應該起作用:
#!/usr/bin/ruby
html_doc = Nokogiri::HTML('<html> <meta content="text/html; charset=UTF-8"/> <body style="margin:20px"> <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p> <ul style="list-style-type:none; margin:25px 15px;"> <li><b>User name:</b> Test User</li> <li><b>User email:</b> test@abc.com</li> <li><b>Identifier:</b> abc123def132afd1213afas</li> <li><b>Description:</b> Tom\'s iPad</li> <li><b>Model:</b> iPad 3</li> <li><b>Platform:</b> </li> <li><b>App:</b> Test app name</li> <li><b>UserID:</b> </li> </ul> <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p> <hr style="height=2px; color:#aaa"/> <p>We hope you enjoy the app store experience!</p> <p style="font-size:18px; color:#999">Powered by App47</p> <img src="https://cirrus.app47.com/notifications/562506219ac25b1033000904/img" alt=""/></body></html>')
問題是您的字符串中有雙引號,這會使解析器感到困惑,因為您還使用了雙引號將字符串引起來。 為了顯示:
puts "foo"bar"
# => SyntaxError: unexpected tIDENTIFIER, expecting end-of-input
# puts "foo"bar"
# ^
您可能打算打印foo"bar
,但是當解析器到達第二個"
(在foo
之后)時,它認為字符串已結束,因此它后面的內容會導致語法錯誤。 (Stack Overflow的語法高亮顯示甚至給了您提示-查看第一行"foo"
的顏色與bar"
顏色不同嗎?一個出色的語法高亮的文本編輯器將執行相同的操作。)
一種解決方案是改用單引號:
puts 'bar"baz'
# => bar"baz
在這種情況下,可以解決此問題,但實際上不會為您提供幫助,因為您的字符串中也包含單引號!
另一種解決方案是通過在\\
前面加上引號來轉義引號,如下所示:
puts "foo\"bar"
# => foo"bar
...但是對於像您這樣的長字符串,這會有點乏味(有時會很棘手)。 更好的解決方案是使用一種特殊的字符串,稱為“ heredoc”(對於“ here document”而言,其價值):
str = <<-END_OF_HTML
<html> <meta content="text/html; charset=UTF-8"/> <body style='margin:20px'> <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p> <ul style='list-style-type:none; margin:25px 15px;'> <li><b>User name:</b> Test User</li> <li><b>User email:</b> test@abc.com</li> <li><b>Identifier:</b> abc123def132afd1213afas</li> <li><b>Description:</b> Tom's iPad</li> <li><b>Model:</b> iPad 3</li> <li><b>Platform:</b> </li> <li><b>App:</b> Test app name</li> <li><b>UserID:</b> </li> </ul> <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p> <hr style='height=2px; color:#aaa'/> <p>We hope you enjoy the app store experience!</p> <p style='font-size:18px; color:#999'>Powered by App47</p> <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML
html_doc = Nokogiri::HTML(str)
分隔符“ END_OF_HTML
”是任意的。 您可以使用EOF
或XYZZY
或任何適合您的想法,盡管使用有意義的東西是一個好主意。 (您會注意到,Stack Overflow的語法突出顯示在heredocs上有一些麻煩;不過,大多數代碼編輯器對此都很好。)
您可以像這樣使它緊湊一些:
Nokogiri::HTML <<-END_OF_HTML
<html> <meta content="text/html; charset=UTF-8"/> <body style='margin:20px'> <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p> <ul style='list-style-type:none; margin:25px 15px;'> <li><b>User name:</b> Test User</li> <li><b>User email:</b> test@abc.com</li> <li><b>Identifier:</b> abc123def132afd1213afas</li> <li><b>Description:</b> Tom's iPad</li> <li><b>Model:</b> iPad 3</li> <li><b>Platform:</b> </li> <li><b>App:</b> Test app name</li> <li><b>UserID:</b> </li> </ul> <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p> <hr style='height=2px; color:#aaa'/> <p>We hope you enjoy the app store experience!</p> <p style='font-size:18px; color:#999'>Powered by App47</p> <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML
或帶括號(看起來有些奇怪,但可以,有時是必需的):
Nokogiri::HTML(<<-END_OF_HTML)
<html> <meta content="text/html; charset=UTF-8"/> <body style='margin:20px'> <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p> <ul style='list-style-type:none; margin:25px 15px;'> <li><b>User name:</b> Test User</li> <li><b>User email:</b> test@abc.com</li> <li><b>Identifier:</b> abc123def132afd1213afas</li> <li><b>Description:</b> Tom's iPad</li> <li><b>Model:</b> iPad 3</li> <li><b>Platform:</b> </li> <li><b>App:</b> Test app name</li> <li><b>UserID:</b> </li> </ul> <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p> <hr style='height=2px; color:#aaa'/> <p>We hope you enjoy the app store experience!</p> <p style='font-size:18px; color:#999'>Powered by App47</p> <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML
您可以在Ruby文檔的“ 文字”部分中閱讀有關heredocs以及其他表示字符串的其他方式的信息。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.