简体   繁体   English

通过C#中的WebBrowser控件获取HTML源码

[英]Getting the HTML source through the WebBrowser control in C#

I tried to get HTML Source in the following way:我尝试通过以下方式获取 HTML 源:

webBrowser1.Document.Body.OuterHtml;

but it does not work.但它不起作用。 For example, if the original HTML source is:例如,如果原始 HTML 来源是:

<html>
<body>
    <div>
        <ul>
            <li>
                <h3>
                    Manufacturer</h3>
            </li>
            <li><a href="/4566-6501_7-0.html?

filter=1000036_3808675_100021_10194772_">Sony </a>(44)</li>
            <li><a href="/4566-6501_7-0.html?

filter=1000036_108496_100021_10194772_">Nikon </a>(19)</li>
            <li><a href="/4566-6501_7-0.html?

filter=1000036_3808726_100021_10194772_">Panasonic </a>(37)</li>
            <li><a href="/4566-6501_7-0.html?

filter=1000036_3808769_100021_10194772_">Canon </a>(29)</li>
            <li><a href="/4566-6501_7-0.html?

filter=1000036_2913388_100021_10194772_">Olympus </a>(21)</li>
            <li class="seeAll"><a href="/4566-6501_7-0.html?

sa=1000036&filter=100021_10194772_" class="readMore">See all manufacturers </a></li>
        </ul>
    </div>
</body>
</html>

but the output of webBrowser1.Document.Body.OuterHtml is:但是webBrowser1.Document.Body.OuterHtml的 output 是:

<body>
    <div>
        <ul>
            <li>
                <h3>
                    Manufacturer</h3>
                <li><a href="/4566-6501_7-0.html?filter=1000036_3808675_100021_10194772_">Sony </a>(44)
                    <li><a href="/4566-6501_7-0.html?filter=1000036_108496_100021_10194772_">Nikon </a>(19)
                        <li><a href="/4566-6501_7-0.html?filter=1000036_3808726_100021_10194772_">Panasonic
                        </a>(37)
                            <li><a href="/4566-6501_7-0.html?filter=1000036_3808769_100021_10194772_">Canon </a>
                                (29)
                                <li><a href="/4566-6501_7-0.html?filter=1000036_2913388_100021_10194772_">Olympus </a>
                                    (21)
                                    <li class="seeAll"><a class="readMore" href="/4566-6501_7-0.html?sa=1000036&amp;filter=100021_10194772_">
                                        See all manufacturers </a></li>
        </ul>
    </div>
</body>

as you can see, many </li> are lost.如您所见,许多</li>都丢失了。

is there a way to get HTML source in WebBrower control correctly?有没有办法在WebBrower控件中正确获取 HTML 源? Note that in my application, I try to use WebBrowser to add coordinate info to every node and output its HTML source with coordinate info which is added as attributes of nodes.请注意,在我的应用程序中,我尝试使用WebBrowser将坐标信息添加到每个节点和 output 其 HTML 源,坐标信息作为节点的属性添加。

anybody can do me a favor?谁能帮我一个忙?

尝试使用DocumentTextDocumentStream属性。

Thank you all. 谢谢你们。 My final solution is: first,using body.outlineHtml to get html source. 我的最终解决方案是:首先,使用body.outlineHtml获取html源代码。 because body.outlineHtml may miss end-tag for <li> and <td> , so the second step is using tidy to repair the HTML source. 因为body.outlineHtml可能会错过<li><td>结束标记,所以第二步是使用整洁来修复HTML源代码。 after these, we can get the HTML source without error 在这之后,我们可以毫无错误地获取HTML源代码

你试过WebBrowser1.DocumentText吗?

If you want to grab the entire HTML source of the WebBrowser control then use this - WebBrowser1.Document.GetElementsByTagName("HTML").Item(0).OuterHtml. 如果你想获取WebBrowser控件的整个HTML源代码,那么使用它 - WebBrowser1.Document.GetElementsByTagName(“HTML”)。Item(0).OuterHtml。 This of course assumes you have properly formatted HTML and the HTML tag exists. 这当然假设您具有格式正确的HTML并且HTML标记存在。 If you want to narrow it down to just the body then obviously change the HTML tag to the BODY tag. 如果您想将其缩小到仅仅是正文,那么显然将HTML标记更改为BODY标记。 This way you grab any and all changes after "DocumentText" has been set. 这样,您可以在设置“DocumentText”后获取任何和所有更改。 Sorry, I'm a VB guy, convert as needed ;) 对不起,我是一个VB人,根据需要进行转换;)

in C#在 C# 中

var document = webBrowser1.Document.GetElementsByTagName("HTML") var document = webBrowser1.Document.GetElementsByTagName("HTML")

html = document[0].OuterHtml; html = 文档[0].OuterHtml;

In this way, you can get all codes in webBrowser1.这样就可以得到webBrowser1中的所有代码了。

Have a look at this. 看看这个。 WebBrowser on MSDN MSDN上的WebBrowser

Alternative you could use Webclient.DownloadString from System.Net (it also has WebClient.DownloadStringAsync ...) Here is the description: WebClient on MSDN 另外你可以使用System.Net中的Webclient.DownloadString (它也有WebClient.DownloadStringAsync ...)这是描述: MSDN上的WebClient

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM