Scraping html headers in R using XML package

Question

I'm trying to extract the header 1 (h1) from a html code like this:

<div class="cuerpo-not"><div mod="2323">
<h1>Jamón 5 Jotas, champagne Bollinger y King Alexander III</h1>

I'm using the function xpathSApply() but it returns nothing:

xpathSApply(webpage, "//div[contains(@class, 'cuerpo-not')]/h1", xmlValue)
# list()

But when I use the same function without specify the class of header, it returns all the information below the class in this format:

xpathSApply(webpage, "//div[contains(@class, 'cuerpo-not')]", xmlValue)

# ;\n\t\t}\n\t}\n\t\n\t\n\tenviarNoticiaLeida_Site( 6916437,16 ) ;\n//]]>Jamón 5 Jotas, champagne Bollinger y King Alexander III\n\n\n\tPor J.M.

How can I extract the information as a string? In other web pages the previous code has worked.

Answer 1

I think you just need one more / in your query down to h1 , as in //h1 instead of /h1 .

library(XML)

x <- '<div class="cuerpo-not"><div mod="2323">
<h1>Jamón 5 Jotas, champagne Bollinger y King Alexander III</h1>'

xpathSApply(htmlParse(x), "//div[contains(@class, 'cuerpo-not')]//h1", xmlValue)
# [1] "JamÃ³n 5 Jotas, champagne Bollinger y King Alexander III"

Scraping html headers in R using XML package

Question

1 answers

solution1
3 ACCPTED 2015-10-03 21:46:01

Scraping html headers in R using XML package

Question

1 answers

solution1 3 ACCPTED 2015-10-03 21:46:01

solution1
3 ACCPTED 2015-10-03 21:46:01