繁体   English   中英

我想提取 h4 中的文本以及与 h4 相关的文本以及与它们相关的链接(使用 xpath)

[英]I want to extract the text inside h4 and the text related to h4 and the link related to them(with xpath)

我想从给定的字符串中提取一些滴度、文本和链接。 python 脚本是这样的:

from lxml.html import fromstring
import requests
import html.parser

url='''
<div class="topLinks">
<div class="hd left">
</div><div class="hd-middle middle">

        <h4>TTTTTTTTTTTTT</h4></div><div class="hd right"></div><div class="boxMiddle"><ul><li><a href="FullStory.aspx?gid=4&id=6516" title="1399/03/18" target="_blank">PPPPPPPPPPPPPPP<img class="new" src="images/new.png"></a></li><li><a href="http://register1.sanjesh.org/fanni99up" title="1399/03/11" target="_blank">CCCCCCCCCCCCC</a></li><li><a href="http://www6.sanjesh.org/download/fani99/FaniNote99.pdf" title="1399/03/11" target="_blank"> ZZZZZZZZ </a></li><li><a href="FullStory.aspx?gid=4&id=6509" title="1399/03/11" target="_blank">FFFFFF</a></li><li><a href="FullStory.aspx?gid=4&id=6498" title="1399/02/21" target="_blank">XXXXXXXXXXXXXX </a></li></ul></div><div class="boxBottom"></div></div>


<div class="topLinks"><div class="hd left_alter"></div><div class="hd-middle middle_alter">

<h4>CCCCCCCCCCCC</h4></div><div class="hd right_alter"></div><div class="boxMiddle_alter"><ul><li><a href="http://register1.sanjesh.org/rgempiactax99/" title="1399/03/18" target="_blank">GGGGGGGGGGGGGGGG <img class="new" src="images/new.png"></a></li><li><a href="FullStory.aspx?gid=11&id=6515" title="1399/03/18" target="_blank">FFFFFFFFF<img class="new" src="images/new.png"></a></li><li><a href="http://register2.sanjesh.org/RGKhanevadehConsult/" title="1399/03/12" target="_blank">HHHHHHHHH</a></li><li><a href="FullStory.aspx?gid=11&id=6512" title="1399/03/12" target="_blank">FFFFFFFF</a></li><li><a href="FullStory.aspx?gid=11&id=6505" title="1399/02/24" target="_blank">NNNNNNNNNNNNNNNNNNNNNNNNNN</a></li><li><a href="http://dl.sanjesh.org/NOETDownload/DownloadHandler.ashx?id=1271" title="1398/12/12" target="_blank">OOOOOOOOOOOO</a></li><li><a href="FullStory.aspx?gid=11&id=6480" title="1399/01/26" target="_blank">JJJJJJJ</a></li></ul></div><div class="boxBottom_alter"></div></div>

'''  

tree = fromstring(url)
titrs = tree.xpath("//div[@class='topLinks']")
for titr in titrs:
    print(titr);

texts = tree.xpath("//div[@class='topLinks']//a/text()")
for text in texts:
    print(text);
    links = tree.xpath("//div[@class='topLinks']//a/@href")
for link in links:
    print(link)

样品 output 是:

严格来说,您需要以下 XPath。 h4="TTTTTTTTTTTTT"的示例:

要检索文本:

//h4[.="TTTTTTTTTTTTT"]/following::div[@class="boxMiddle"]//text()

要检索链接:

//h4[.="TTTTTTTTTTTTT"]/following::div[@class="boxMiddle"]//@href

一个班轮:

(//text()[normalize-space()]|//@href)[preceding::h4[1][.="TTTTTTTTTTTTT"]]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM