简体   繁体   中英

How to get text of a webpage?

Is there any way to get only the text (source) of a webpage? I tried using two approach:

Using WebRequest
        WebRequest myWebRequest = WebRequest.Create("http://www.website.com/");
        WebResponse myWebResponse = myWebRequest.GetResponse();
        Stream ReceiveStream = myWebResponse.GetResponseStream();
        Encoding encode = System.Text.Encoding.GetEncoding("utf-8");
        StreamReader readStream = new StreamReader(ReceiveStream, encode);
        string html = readStream.ReadToEnd();
        readStream.Close();
        myWebResponse.Close();

This approach works fine if the requested webpage is static. However, if the content of requested webpage is generated only when a page load occurs, I do not get proper source content.

Using Web Browser
            WebBrowser browser = new WebBrowser();
            browser.ScrollBarsEnabled = false;
            browser.ScriptErrorsSuppressed = true;
            browser.Navigate(new Uri(http://www.website.com/));

This approach gives proper source content every time, but takes a lot of time and also shows popups. Also, some webistes show browser version popup and even some open in IE (which I don't want).

My final objective is to get the source content of the webpage as fast as possible without opening the browser or getting any popups. Please let me know about any possible way that I can use for achieving the desired solution. Thanks.

You appear to want some sort of browser functionality without the actual browser.

Many tools exist for this, the most prominent being Selenium , coupled with PhantomJS you will be able to have the fully functional browser launch without the physical browser overhead.

You'd then be able to do something like (Selenium example):

IWebDriver driver = new PhantomJSDriver();
driver.Navigate().GoToUrl("http://www.website.com");
string fullSource = driver.PageSource;

When using basic HttpWebRequest 's and WebBrowser control's, you soon hit issues when pages are slow loading, or are so JS-heavy that you don't get the expected result.

Decided to post my code. this works for my ASP and PHP dynamic pages. You can modify the code for your needs because this one was used to crawl through a full ASP or PHP website and these methods were called to get content.

 class WebReader
    {
        private string onlineText = "";
        public string getOnlineText()
        {
            return onlineText;
        }
        public WebReader(String strLocation,String strFile){
            Stream strm = null;
            StreamReader MyReader = null;

            try
            {
                // Download the web page.
                strm = GetURLStream("http://" + strLocation +"/" + strFile);
                if (strm != null)
                {
                    // We have a stream, let's attach a byte reader.
                    char[] strBuffer = new char[3001];

                    MyReader = new StreamReader(strm);

                    // Read 3,000 bytes at a time until we get the whole file.
                    string strLine = "";
                    while (MyReader.Read(strBuffer, 0, 3000) > 0)
                    {
                        strLine += new string(strBuffer);

                    }
                    onlineText = strLine;
                }
            }
            catch (Exception excep)
            {
                Console.WriteLine("Error: " + excep.Message);
            }
            finally
            {
                // Clean up and close the stream.
                if (MyReader != null)
                {
                    MyReader.Close();
                }

                if (strm != null)
                {
                strm.Close();
                }
            }
        }
        public Stream GetURLStream(string strURL)
        {
            System.Net.WebRequest objRequest;
            System.Net.WebResponse objResponse = null;
            Stream objStreamReceive;

            try
            {
                objRequest = System.Net.WebRequest.Create(strURL);
                objRequest.Timeout = 5000;


                objResponse = objRequest.GetResponse();
                objStreamReceive = objResponse.GetResponseStream();

                return objStreamReceive;
            }
            catch (Exception excep)
            {
                Console.WriteLine(excep.Message);
                objResponse.Close();

                return null;
            }
        }

        public void ReadWriteStream(Stream readStream, Stream writeStream, frmUpdater _MyParent, int CurrentVersion, long BytesCompleted)
        {
            int Length = 2048;
            Byte[] buffer = new Byte[Length];
            int bytesRead = readStream.Read(buffer, 0, Length);
            // write the required bytes
            while (bytesRead > 0)
            {
                writeStream.Write(buffer, 0, bytesRead);
                bytesRead = readStream.Read(buffer, 0, Length);
                _MyParent.RefreshDownloadLabels(CurrentVersion,BytesCompleted + writeStream.Position);
                Application.DoEvents();
            }
            readStream.Close();
            writeStream.Close();
        }
    }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM