简体   繁体   English

如何解析HTML以修改所有单词

[英]How to parse HTML to modify all words

This seems to be a recurring question, but here goes. 这似乎是一个反复出现的问题,但这里有。

I have HTML which is well-formatted (it comes from a controlled source, so this can be taken to be a given). 我有格式良好的HTML(它来自受控源,所以这可以被认为是给定的)。 I need to iterate through the contents of the body of the HTML, look for all the words in the document, perform some editing on those words, and save the results. 我需要遍历HTML正文的内容,查找文档中的所有单词,对这些单词执行一些编辑,然后保存结果。

For example, I have file sample.html and I want to run it through my application and product output.html, which is exactly the same as the original, plus my edits. 例如,我有文件sample.html,我想通过我的应用程序和产品output.html运行它,这与原始文件完全相同,加上我的编辑。

I found the following using HTMLAgilityPack, but all the examples I've found look at the attributes of the specified tags - is there an easy modification that will look at the contents and perform my edits? 我发现以下使用HTMLAgilityPack,但我发现的所有示例都是查看指定标记的属性 - 是否有一个简单的修改,它将查看内容并执行我的编辑?

HtmlDocument HD = new HtmlDocument();
HD.Load (@"e:\test.htm");
var NoAltElements = HD.DocumentNode.SelectNodes("//img[not(@alt)]");
if (NoAltElements != null)
{
    foreach (HtmlNode HN in NoAltElements)
    {
       HN.Attributes.Append("alt", "no alt image");
    }
}

HD.Save(@"e:\test.htm");

The above looks for image tags with no ALT tags. 以上是查找没有ALT标签的图像标签。 I want to look for all tags in the <body> of the file and do something with the contents (which may involve creating new tags in the process). 我想查找文件的<body>中的所有标记,并对内容执行某些操作(可能涉及在此过程中创建新标记)。

A very simple sample of what I might do is take the following input: 我可能做的一个非常简单的示例是采用以下输入:

<html>
    <head><title>Some Title</title></head>
    <body>
        <h1>This is my page</h1>
        <p>This is a paragraph of text.</p>
    </body>
</html>

and produce the output, which takes every word and alternates between making it uppercase and making it italics: 并产生输出,它取每个单词并交替使其成为大写并使其成斜体:

<html>
    <head><title>Some Title</title></head>
    <body>
        <h1>THIS <em>is</em> MY <em>page</em></h1>
        <p>THIS <em>is</em> A <em>paragraph</em> OF <em>text</em>.</p>
    </body>
</html>

Ideas, suggestions? 想法,建议?

Personally, given this setup, I'd work with the InnerText property of HtmlNode to find the words (probably with Regex so I can exclude for punctuation and not simply rely on spaces) and then use the InnerHtml property to make the changes using iterative calls to Regex.Replace (because the Regex.Replace has a method that allows you to specify both start position and number of times to replace). 就个人而言,鉴于此设置,我将使用HtmlNode的InnerText属性来查找单词(可能使用Regex,因此我可以排除标点符号而不是简单地依赖空格)然后使用InnerHtml属性使用迭代调用进行更改到Regex.Replace(因为Regex.Replace有一个方法,允许你指定开始位置和要替换的次数)。

Processing code: 处理代码:

IEnumerable<HtmlNode> nodes = doc.DocumentNode.DescendantNodes().Where(n => n.InnerText == "something");
foreach (HtmlNode node in nodes)
{
    string[] words = getWords(node.InnerText);

    node.InnerHtml = processHtml(node.InnerHtml, words);
}

identify words (there's probably some slicker way to do this but here's an initial stab): 识别单词(可能有一些更明智的方法来做到这一点,但这是一个初始的刺伤):

private string[] getWords(string text)
{
    Regex reg = new Regex("/w+");
    MatchCollection matches = reg.Matches(text);
    List<string> words = new List<string>();
    foreach (Match match in matches)
    {
        words.Add(match.Value);
    }
    return words.ToArray();
}

process the html: 处理html:

private string processHtml(string html, string[] words)
{
    int startPosition = 0;
    foreach (string word in words)
    {
        startPosition = html.IndexOf(word, startPosition);
        Regex reg = new Regex(word);
        html = reg.Replace(html, alterWord(word), 1, startPosition);
    }

    return html;
}

I'll leave the details of alterWord() to you. 我会把alterWord()的细节留给你。 :) :)

Try .SelectNodes("//body//*") . 尝试.SelectNodes("//body//*") That'll get you all elements within any body element, at any depth. 这将获得任何body元素中任何深度的所有元素。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM