如何忽略 <br> 标记xpath

Question

myList = tree.xpath('//div[@id="RM1127"]/div[@class="moreInfo"]/text()')

I'm scraping a website for elements inside this div. 我正在为这个div内的元素抓取一个网站。 It works fine but on this one div, there's a <b> tag. 它工作正常，但在这一格上，有一个<b>标记。 myList returns elements for that div as two separate elements. myList将该div的元素作为两个单独的元素返回。

<div class="moreInfo" style="display:none;font-weight:normal; font-size:14px; margin-top:6px; padding:0px 0 0 30px;">

    Over ½ lb. of jumbo shrimp fried golden crisp in a…

    <br></br>

    coleslaw, cocktail & Tartar sauce. …

</div>

The html looks like this. HTML看起来像这样。 Instead of having 'Over ½ lb. of jumbo shrimp fried golden crisp in a' and 'coleslaw, cocktail & Tartar sauce' together as one element, I'm getting them both as separate elements in an array. 与其将“超过½磅的大虾炸金酥”和“凉拌卷心菜，鸡尾酒和塔塔酱”作为一个元素，不如将它们作为单独的元素排列在一起。

Answer 1

Using Python XPath + LXML , just invoke HtmlElement.text_content() . 使用Python XPath + LXML ，只需调用HtmlElement.text_content() 。 Take a look on this full exemple: 看一下这个完整的例子：

from lxml import etree
import lxml.html    

html  ="""<!DOCTYPE html>
<html>
<body>
    <div id="RM1127">
        <div class="moreInfo" style="">

            Over 1/2 lb. of jumbo shrimp fried golden crisp in a...

            <br>

            coleslaw, cocktail & Tartar sauce

        </div>
    </div>
</body>
</html>"""

dom = lxml.html.fromstring(html)
tags = dom.xpath("""//div[@id="RM1127"]/div[@class="moreInfo"]""")

for e in tags:
    print(e.text_content())

From doc : 从文档：

lxml.html.HtmlElement.text_content(): lxml.html.HtmlElement.text_content（）：
Returns the text content of the element, including the text content of its children, with no markup. 返回元素的文本内容，包括其子元素的文本内容，不带标记。

Answer 2

Try the following XPath expression: 尝试以下XPath表达式：

string(//div[@id="RM1127"]/div[@class="moreInfo"])

When applied to a node-set, the XPath string function returns the string-value of the node that is first in document order. 当应用于节点集时，XPath 字符串函数返回按文档顺序排在第一位的节点的字符串值。 The string-value of an element node is the concatenation of the string-values of all text node descendants . 元素节点的字符串值是所有文本节点后代的字符串值的串联。

如何忽略 <br> 标记xpath

问题描述

2 个解决方案

解决方案1
0 2015-10-06 17:28:03

解决方案2
0 2015-10-06 20:30:22

如何忽略 <br> 标记xpath

问题描述

2 个解决方案

解决方案1 0 2015-10-06 17:28:03

解决方案2 0 2015-10-06 20:30:22

解决方案1
0 2015-10-06 17:28:03

解决方案2
0 2015-10-06 20:30:22