如何获取网页文字？

Question

Is there any way to get only the text (source) of a webpage? 有什么方法可以只获取网页的文本（源）吗？ I tried using two approach: 我尝试使用两种方法：

Using WebRequest
        WebRequest myWebRequest = WebRequest.Create("http://www.website.com/");
        WebResponse myWebResponse = myWebRequest.GetResponse();
        Stream ReceiveStream = myWebResponse.GetResponseStream();
        Encoding encode = System.Text.Encoding.GetEncoding("utf-8");
        StreamReader readStream = new StreamReader(ReceiveStream, encode);
        string html = readStream.ReadToEnd();
        readStream.Close();
        myWebResponse.Close();

This approach works fine if the requested webpage is static. 如果请求的网页是静态的，则此方法可以正常工作。 However, if the content of requested webpage is generated only when a page load occurs, I do not get proper source content. 但是，如果仅在页面加载发生时才生成请求的网页的内容，那么我将无法获得正确的源内容。

Using Web Browser
            WebBrowser browser = new WebBrowser();
            browser.ScrollBarsEnabled = false;
            browser.ScriptErrorsSuppressed = true;
            browser.Navigate(new Uri(http://www.website.com/));

This approach gives proper source content every time, but takes a lot of time and also shows popups. 这种方法每次都会提供适当的源内容，但是会花费很多时间，并且还会显示弹出窗口。 Also, some webistes show browser version popup and even some open in IE (which I don't want). 另外，有些网站管理员会显示浏览器版本弹出窗口，甚至有些会在IE中打开（我不希望如此）。

My final objective is to get the source content of the webpage as fast as possible without opening the browser or getting any popups. 我的最终目标是在不打开浏览器或没有任何弹出窗口的情况下，尽快获取网页的源内容。 Please let me know about any possible way that I can use for achieving the desired solution. 请让我知道我可以用来实现所需解决方案的任何可能方式。 Thanks. 谢谢。

Answer 1

You appear to want some sort of browser functionality without the actual browser. 您似乎想要某种浏览器功能而不需要实际的浏览器。

Many tools exist for this, the most prominent being Selenium , coupled with PhantomJS you will be able to have the fully functional browser launch without the physical browser overhead. 为此，存在许多工具，其中最著名的是Selenium ，再加上PhantomJS，您将能够在没有物理浏览器开销的情况下启动功能齐全的浏览器。

You'd then be able to do something like (Selenium example): 然后，您可以执行以下操作（硒示例）：

IWebDriver driver = new PhantomJSDriver();
driver.Navigate().GoToUrl("http://www.website.com");
string fullSource = driver.PageSource;

When using basic HttpWebRequest 's and WebBrowser control's, you soon hit issues when pages are slow loading, or are so JS-heavy that you don't get the expected result. 当使用基本的HttpWebRequest和WebBrowser控件时，页面加载缓慢或JS太繁重而您无法获得预期的结果时，您很快就会遇到问题。

Answer 2

Decided to post my code. 决定发布我的代码。 this works for my ASP and PHP dynamic pages. 这适用于我的ASP和PHP动态页面。 You can modify the code for your needs because this one was used to crawl through a full ASP or PHP website and these methods were called to get content. 您可以根据需要修改代码，因为该代码用于在整个ASP或PHP网站中进行爬网，并且调用了这些方法来获取内容。

 class WebReader
    {
        private string onlineText = "";
        public string getOnlineText()
        {
            return onlineText;
        }
        public WebReader(String strLocation,String strFile){
            Stream strm = null;
            StreamReader MyReader = null;

            try
            {
                // Download the web page.
                strm = GetURLStream("http://" + strLocation +"/" + strFile);
                if (strm != null)
                {
                    // We have a stream, let's attach a byte reader.
                    char[] strBuffer = new char[3001];

                    MyReader = new StreamReader(strm);

                    // Read 3,000 bytes at a time until we get the whole file.
                    string strLine = "";
                    while (MyReader.Read(strBuffer, 0, 3000) > 0)
                    {
                        strLine += new string(strBuffer);

                    }
                    onlineText = strLine;
                }
            }
            catch (Exception excep)
            {
                Console.WriteLine("Error: " + excep.Message);
            }
            finally
            {
                // Clean up and close the stream.
                if (MyReader != null)
                {
                    MyReader.Close();
                }

                if (strm != null)
                {
                strm.Close();
                }
            }
        }
        public Stream GetURLStream(string strURL)
        {
            System.Net.WebRequest objRequest;
            System.Net.WebResponse objResponse = null;
            Stream objStreamReceive;

            try
            {
                objRequest = System.Net.WebRequest.Create(strURL);
                objRequest.Timeout = 5000;


                objResponse = objRequest.GetResponse();
                objStreamReceive = objResponse.GetResponseStream();

                return objStreamReceive;
            }
            catch (Exception excep)
            {
                Console.WriteLine(excep.Message);
                objResponse.Close();

                return null;
            }
        }

        public void ReadWriteStream(Stream readStream, Stream writeStream, frmUpdater _MyParent, int CurrentVersion, long BytesCompleted)
        {
            int Length = 2048;
            Byte[] buffer = new Byte[Length];
            int bytesRead = readStream.Read(buffer, 0, Length);
            // write the required bytes
            while (bytesRead > 0)
            {
                writeStream.Write(buffer, 0, bytesRead);
                bytesRead = readStream.Read(buffer, 0, Length);
                _MyParent.RefreshDownloadLabels(CurrentVersion,BytesCompleted + writeStream.Position);
                Application.DoEvents();
            }
            readStream.Close();
            writeStream.Close();
        }
    }

如何获取网页文字？

问题描述

2 个解决方案

解决方案1
1 2013-11-05 12:19:52

解决方案2
0 2013-11-05 12:22:44

如何获取网页文字？

问题描述

2 个解决方案

解决方案1 1 2013-11-05 12:19:52

解决方案2 0 2013-11-05 12:22:44

解决方案1
1 2013-11-05 12:19:52

解决方案2
0 2013-11-05 12:22:44