这可以改善吗？(Can this be improved? Scrubbing of dangerous html tags)

我发现，对于我认为非常重要的东西，关于如何处理这个问题的信息或库很少。

我在搜索时找到了这个。我真的不知道黑客可以尝试插入危险标签的百万种方式。

我有一个丰富的HTML编辑器，所以我需要保留非危险的标签，但删除不好的标签。

这个脚本错过了什么吗？

它使用html敏捷包。

public string ScrubHTML(string html) { HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html); //Remove potentially harmful elements HtmlNodeCollection nc = doc.DocumentNode.SelectNodes("//script|//link|//iframe|//frameset|//frame|//applet|//object|//embed"); if (nc != null) { foreach (HtmlNode node in nc) { node.ParentNode.RemoveChild(node, false); } } //remove hrefs to java/j/vbscript URLs nc = doc.DocumentNode.SelectNodes("//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]"); if (nc != null) { foreach (HtmlNode node in nc) { node.SetAttributeValue("href", "#"); } } //remove img with refs to java/j/vbscript URLs nc = doc.DocumentNode.SelectNodes("//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]"); if (nc != null) { foreach (HtmlNode node in nc) { node.SetAttributeValue("src", "#"); } } //remove on<Event> handlers from all tags nc = doc.DocumentNode.SelectNodes("//*[@onclick or @onmouseover or @onfocus or @onblur or @onmouseout or @ondoubleclick or @onload or @onunload]"); if (nc != null) { foreach (HtmlNode node in nc) { node.Attributes.Remove("onFocus"); node.Attributes.Remove("onBlur"); node.Attributes.Remove("onClick"); node.Attributes.Remove("onMouseOver"); node.Attributes.Remove("onMouseOut"); node.Attributes.Remove("onDoubleClick"); node.Attributes.Remove("onLoad"); node.Attributes.Remove("onUnload"); } } // remove any style attributes that contain the word expression (IE evaluates this as script) nc = doc.DocumentNode.SelectNodes("//*[contains(translate(@style, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'expression')]"); if (nc != null) { foreach (HtmlNode node in nc) { node.Attributes.Remove("stYle"); } } return doc.DocumentNode.WriteTo(); }

编辑

2人建议使用白名单。我实际上喜欢白名单的想法，但从来没有真正做到这一点，因为没有人能真正告诉我如何在C＃中做到这一点我甚至无法找到如何在c＃中做到的教程（我最后一次看。我会再看一遍）。

你怎么做白名单？它只是一个列表集合吗？

你如何实际解析出所有的html标签，脚本标签和其他所有标签？

一旦你有标签，你如何确定允许哪些？将它们与列表集合进行比较？但是如果内容进来并且有100个标签并且你有50个允许，会发生什么。您需要将这100个标记中的每一个与50个允许的标记进行比较。这是相当多的经历，可能会很慢。

找到无效标签后如何删除它？如果发现一个标签无效，我真的不想拒绝整套文本。我宁愿删除并插入其余部分。

我应该使用html敏捷包吗？

I been finding that for something that I consider pretty import there is very little information or libraries on how to deal with this problem.

I found this while searching. I really don't know all the million ways that a hacker could try to insert the dangerous tags.

I have a rich html editor so I need to keep non dangerous tags but strip out bad ones.

So is this script missing anything?

It uses html agility pack.

public string ScrubHTML(string html) { HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html); //Remove potentially harmful elements HtmlNodeCollection nc = doc.DocumentNode.SelectNodes("//script|//link|//iframe|//frameset|//frame|//applet|//object|//embed"); if (nc != null) { foreach (HtmlNode node in nc) { node.ParentNode.RemoveChild(node, false); } } //remove hrefs to java/j/vbscript URLs nc = doc.DocumentNode.SelectNodes("//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]"); if (nc != null) { foreach (HtmlNode node in nc) { node.SetAttributeValue("href", "#"); } } //remove img with refs to java/j/vbscript URLs nc = doc.DocumentNode.SelectNodes("//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]"); if (nc != null) { foreach (HtmlNode node in nc) { node.SetAttributeValue("src", "#"); } } //remove on<Event> handlers from all tags nc = doc.DocumentNode.SelectNodes("//*[@onclick or @onmouseover or @onfocus or @onblur or @onmouseout or @ondoubleclick or @onload or @onunload]"); if (nc != null) { foreach (HtmlNode node in nc) { node.Attributes.Remove("onFocus"); node.Attributes.Remove("onBlur"); node.Attributes.Remove("onClick"); node.Attributes.Remove("onMouseOver"); node.Attributes.Remove("onMouseOut"); node.Attributes.Remove("onDoubleClick"); node.Attributes.Remove("onLoad"); node.Attributes.Remove("onUnload"); } } // remove any style attributes that contain the word expression (IE evaluates this as script) nc = doc.DocumentNode.SelectNodes("//*[contains(translate(@style, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'expression')]"); if (nc != null) { foreach (HtmlNode node in nc) { node.Attributes.Remove("stYle"); } } return doc.DocumentNode.WriteTo(); }

Edit

2 people have suggested whitelisting. I actually like the idea of whitelisting but never actually did it because no one can actually tell me how to do it in C# and I can't even really find tutorials for how to do it in c#(the last time I looked. I will check it out again).

How do you make a white list? Is it just a list collection?

How do you actual parse out all html tags, script tags and every other tag?

Once you have the tags how do you determine which ones are allowed? Compare them to you list collection? But what happens if the content is coming in and has like 100 tags and you have 50 allowed. You got to compare each of those 100 tag by 50 allowed tags. Thats quite a bit to go through and could be slow.

Once you found a invalid tag how do you remove it? I don't really want to reject a whole set of text if one tag was found to be invalid. I rather remove and insert the rest.

Should I be using html agility pack?

最满意答案

是的，我已经看到你错过了mousedown，onmouseup，onchange，onsubmit等。这是为什么应该为标签和属性使用白名单的一部分。即使你现在有一个完美的黑名单（非常不可能），标签和属性也经常被添加。

请参阅为何使用白名单进行HTML清理？。

Yes, I already see you're missing onmousedown, onmouseup, onchange, onsubmit, etc. This is part of why should use whitelisting for both tags and attributes. Even if you had a perfect blacklist now (very unlikely), tags and attributes are added fairly often.

See Why use a whitelist for HTML sanitizing?.

更多推荐