简体   繁体   English

使用R和XPath 1.0进行XML查询时,无法提取特定文本

[英]With XML query using R and XPath 1.0, unable to extract specific text

I would greatly appreciate guidance on how I can extract the four names of cities where this firm has offices. 我将非常感谢有关如何提取该公司设有办事处的四个城市名称的指导。 Firebug has the name greyed-out under cufontext, such as <cufontext>MEMPHIS</cufontext> and MEMPHIS is in grey. Firebug在cufontext下的名称为灰色,例如<cufontext>MEMPHIS</cufontext>而MEMPHIS为灰色。 BTW, I don't mind getting some extraneous text back such as state or address. 顺便说一句,我不介意收到一些多余的文字,例如州或地址。 Three of my failed efforts are shown. 显示了我失败的努力中的三个。

library(XML)

doc <- htmlTreeParse('http://www.lewisthomason.com/locations/', useInternal = TRUE, asText = TRUE)               
xpathSApply(doc, "//div[@id = 'the_content']", xmlValue, trim = TRUE)  # returns list()
xpathSApply(doc, "//div[@id = 'the_content']/div/h3//cufon", xmlValue, trim = TRUE) # returns NULL
xpathSApply(doc, "//div[@id = 'the_content']//cufon[@class = 'cufon cufon-canvas']", xmlValue, trim = TRUE)  # returns NULL

Thank you very much. 非常感谢你。

Turned out that the HTML source actually looks about like this (formatted & simplified) : 原来,HTML源实际上看起来像这样(格式化和简化):

<div id="the_content">
    <div class="one_fourth">
        <h3>KNOXVILLE</h3>
        <p>One Centre Square, Fifth Floor<br>
        .....
    </div>
    ....
</div>

but browsers (tried using chrome and firefox) somehow transform it to a slightly different structure, while the parser doesn't do the transformation. 但是浏览器(使用chrome和firefox尝试)以某种方式将其转换为稍有不同的结构,而解析器则不进行转换。 This simpler XPath worked fine for me : 这个简单的XPath对我来说很好用:

//div[@id = 'the_content']/div/h3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM