简体   繁体   English

XML:使用C#搜索特定文本的元素

[英]XML: Searching elements for specific text using C#

I'm trying to get a list of PDF links from different websites. 我正在尝试从不同的网站获取PDF链接列表。 First I'm using the Web client class to download the page source. 首先,我使用Web客户端类下载页面源。 I then use sgmlReader to convert the HTML to XML. 然后我使用sgmlReader将HTML转换为XML。 So for one particular site, I'll get a tag that looks like this: 所以对于一个特定的网站,我会得到一个看起来像这样的标签:

<p><a href="pub/1985_to_1997_Board_Action_Summary.pdf">1985 to 1997 Board Action Summary</a></p>

I need to grab all the links that contain ".pdf". 我需要获取包含“.pdf”的所有链接。 Obviously not all websites are laid out the same, so just searching for a <p> tag, wont be dynamic enough. 显然并非所有网站的布局都相同,所以只搜索<p>标签,不够动态。 I'd rather not use linq, but I will if I have to. 我宁愿不使用linq,但如果必须,我会的。 Thanks in advance. 提前致谢。

Linq makes this easy... Linq让这很容易......

var hrefs = doc.Root.Descendants("a")
    .Where(a => a.Attrib("href").Value.ToUpper().EndsWith(".PDF"))
    .Select(a => a.Attrib("href"));

away you go! 你走吧! (note: did this from memory, so you might have to fix it somewhat) (注意:这是从内存中做到的,所以你可能需要稍微修复一下)

This will break down for <a/> tags that don't have an href (anchors) but you can fix that surely... 这将分解没有href (锚点)的<a/>标签,但你可以肯定地解决这个问题......

I think you have 2 options here. 我想你有两个选择。 If you need only the links, you can use Regular Expressions to find the matches for strings ending with .pdf. 如果只需要链接,则可以使用正则表达式查找以.pdf结尾的字符串的匹配项。 If you need to manipulate the XML structure or get other values from the XML, it would be better to use XmlDocument and use an XPath query to find out the nodes which have a link to a pdf file in it. 如果您需要操作XML结构或从XML获取其他值,最好使用XmlDocument并使用XPath查询找出其中包含pdf文件链接的节点。 Using LINQ to XML just reduces the number of lines of code you have to write. 使用LINQ to XML只会减少您必须编写的代码行数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM