简体   繁体   English

用敏捷包消毒未知数量的后代不起作用

[英]Sanitizing unknown number of descendants with agility pack doesn't work

The purpose of this code below is to be able to accept strings from cliënts that might contain HTML and remove styling, scripting, certain tags and replace H tags by B tags. 下面这段代码的目的是能够接受可能包含HTML的客户端字符串,并删除样式,脚本,某些标签并用B标签替换H标签。

  private IDictionary<string, string[]> Whitelist;
    public vacatures PostPutVacancy(vacancy vacancy)
    {
        //List of allowed tags
        Whitelist = new Dictionary<string, string[]> {
            { "p", null },
            { "ul", null },
            { "li", null },
            { "br", null },
            { "b", null },
            { "table", null },
            { "tr", null },
            { "th", null },
            { "td", null },
            { "strong", null }
        };

        foreach (var item in vacancy.GetType().GetProperties())
        {
            if (vacancy.GetType().GetProperty(item.Name).PropertyType.FullName.Contains("String"))
            {
                var value = item.GetValue(vacancy, null);
                if (value != null)
                {
                    item.SetValue(vacancy, CallSanitizers(item.GetValue(vacancy, null)));
                    var test1 = item.GetValue(vacancy);
                }
            }
        }

        return vacancy;
    }

    private List<string> hList = new List<string>
    {
        { "h1"},
        { "h2"},
        { "h3"},
        { "h4"},
        { "h5"},
        { "h6"}
    };

    private string CallSanitizers(object obj)//==Sanitize()
    {
        string str = obj.ToString();

        if (str != HttpUtility.HtmlEncode(str))
        {
            doc.LoadHtml(str);
            SanitizeNode(doc.DocumentNode);
            string test = doc.DocumentNode.WriteTo().Trim();
            return doc.DocumentNode.WriteTo().Trim();
        }
        else
        {
            return str;
        }
    }

    private void SanitizeChildren(HtmlNode parentNode)
    {
        for (int i = parentNode.ChildNodes.Count - 1; i >= 0; i--)
        {
            SanitizeNode(parentNode.ChildNodes[i]);
        }
    }

    private void SanitizeNode(HtmlNode node)
    {
        if (node.NodeType == HtmlNodeType.Element)
        {
            if (!Whitelist.ContainsKey(node.Name))
            {
                if (hList.Contains(node.Name))
                {
                    HtmlNode b = doc.CreateElement("b");
                    b.InnerHtml = node.InnerHtml;
                    node.ParentNode.ReplaceChild(b, node);
                }
                else
                {
                    node.ParentNode.RemoveChild(node, true);
                }
            }

            if (node.HasAttributes)
            {
                for (int i = node.Attributes.Count - 1; i >= 0; i--)
                {
                    HtmlAttribute currentAttribute = node.Attributes[i];
                    node.Attributes.Remove(currentAttribute);
                }
            }
        }

        if (node.HasChildNodes)
        {
            SanitizeChildren(node);
        }
    }

It works but there is one problem, child nodes of child nodes don't get sanitized, see example. 它有效,但是存在一个问题,子节点的子节点没有被清理,请参见示例。

Input: 输入:

"Lorem ipsum<h1 style='font-size:38px;'><p style='font-size:38px;'>dolor sit</p></h1> amet <h1 style='font-size:38px;'><strong style='font-size:38px;'>consectetur adipiscing</strong></h1>"

Result: 结果:

"Lorem ipsum<b><p style='font-size:38px;'>dolor sit</p></b> amet <b style='font-size:38px;'><strong style='font-size:38px;'>consectetur adipiscing</strong></b>"

The problem must be due to not being able to place a child back into a changed parent since the parent not recognized anymore because of the change of tag type. 问题一定是由于无法将孩子放回更改后的父级中,因为由于标记类型的更改,父级不再能被识别。

Does anybody know how to fix this? 有人知道如何解决此问题吗?

Please post a comment if the question is unclear or not well formulated. 如果问题不清楚或措辞不当,请发表评论。

Thanks in advance 提前致谢

This fixes it 这修复了它

        private string CallSanitizers(string str)
    {

        if (str != HttpUtility.HtmlEncode(str))
        {
            doc.LoadHtml(str);
            str = Sanitizers();
            return doc.DocumentNode.WriteTo().Trim();
        }
        else
        {
            return str;
        }
    }

    private string Sanitizers()
    {
        doc.DocumentNode.Descendants().Where(l => l.Name == "script" || l.Name == "style").ToList().ForEach(l => l.Remove());
        doc.DocumentNode.Descendants().Where(l => hList.Contains(l.Name)).ToList().ForEach(l => l.Name = "b");
        doc.DocumentNode.Descendants().Where(l => l.Attributes != null).ToList().ForEach(l => l.Attributes.ToList().ForEach(a => a.Remove()));
        doc.DocumentNode.Descendants().Where(l => !Whitelist.Contains(l.Name) && l.NodeType == HtmlNodeType.Element).ToList().ForEach(l => l.ParentNode.RemoveChild(l, true));
        return doc.DocumentNode.OuterHtml;
    }

    //lijst van tags die worden vervangen door <b></b>
    private List<string> hList = new List<string>
    {
        { "h1"},
        { "h2"},
        { "h3"},
        { "h4"},
        { "h5"},
        { "h6"}
    };

    List<string> Whitelist = new List<string>
    {
        { "p"},
        { "ul"},
        { "li"},
        { "br"},
        { "b"},
        { "table"},
        { "tr"},
        { "th"},
        { "td"},
        { "strong"}
    };

The input is 输入是

"<head><script>alert('Hello!');</script></head><div><div><h1>Lorem ipsum </h1></div></div> <h1 style='font-size:38px;'><p style='font-size:38px;'>dolor </p></h1> sit <h1 style='font-size:38px;'><strong style='font-size:38px;'>amet</strong></h1>"

And the output is 输出是

"<b>Lorem ipsum</b> <b><p>dolor</p></b> sit <b><strong>amet</strong></b>"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM