简体   繁体   English

如何获取网页文字?

[英]How to get text of a webpage?

Is there any way to get only the text (source) of a webpage? 有什么方法可以只获取网页的文本(源)吗? I tried using two approach: 我尝试使用两种方法:

Using WebRequest
        WebRequest myWebRequest = WebRequest.Create("http://www.website.com/");
        WebResponse myWebResponse = myWebRequest.GetResponse();
        Stream ReceiveStream = myWebResponse.GetResponseStream();
        Encoding encode = System.Text.Encoding.GetEncoding("utf-8");
        StreamReader readStream = new StreamReader(ReceiveStream, encode);
        string html = readStream.ReadToEnd();
        readStream.Close();
        myWebResponse.Close();

This approach works fine if the requested webpage is static. 如果请求的网页是静态的,则此方法可以正常工作。 However, if the content of requested webpage is generated only when a page load occurs, I do not get proper source content. 但是,如果仅在页面加载发生时才生成请求的网页的内容,那么我将无法获得正确的源内容。

Using Web Browser
            WebBrowser browser = new WebBrowser();
            browser.ScrollBarsEnabled = false;
            browser.ScriptErrorsSuppressed = true;
            browser.Navigate(new Uri(http://www.website.com/));

This approach gives proper source content every time, but takes a lot of time and also shows popups. 这种方法每次都会提供适当的源内容,但是会花费很多时间,并且还会显示弹出窗口。 Also, some webistes show browser version popup and even some open in IE (which I don't want). 另外,有些网站管理员会显示浏览器版本弹出窗口,甚至有些会在IE中打开(我不希望如此)。

My final objective is to get the source content of the webpage as fast as possible without opening the browser or getting any popups. 我的最终目标是在不打开浏览器或没有任何弹出窗口的情况下,尽快获取网页的源内容。 Please let me know about any possible way that I can use for achieving the desired solution. 请让我知道我可以用来实现所需解决方案的任何可能方式。 Thanks. 谢谢。

You appear to want some sort of browser functionality without the actual browser. 您似乎想要某种浏览器功能而不需要实际的浏览器。

Many tools exist for this, the most prominent being Selenium , coupled with PhantomJS you will be able to have the fully functional browser launch without the physical browser overhead. 为此,存在许多工具,其中最著名的是Selenium ,再加上PhantomJS,您将能够在没有物理浏览器开销的情况下启动功能齐全的浏览器。

You'd then be able to do something like (Selenium example): 然后,您可以执行以下操作(硒示例):

IWebDriver driver = new PhantomJSDriver();
driver.Navigate().GoToUrl("http://www.website.com");
string fullSource = driver.PageSource;

When using basic HttpWebRequest 's and WebBrowser control's, you soon hit issues when pages are slow loading, or are so JS-heavy that you don't get the expected result. 当使用基本的HttpWebRequestWebBrowser控件时,页面加载缓慢或JS太繁重而您无法获得预期的结果时,您很快就会遇到问题。

Decided to post my code. 决定发布我的代码。 this works for my ASP and PHP dynamic pages. 这适用于我的ASP和PHP动态页面。 You can modify the code for your needs because this one was used to crawl through a full ASP or PHP website and these methods were called to get content. 您可以根据需要修改代码,因为该代码用于在整个ASP或PHP网站中进行爬网,并且调用了这些方法来获取内容。

 class WebReader
    {
        private string onlineText = "";
        public string getOnlineText()
        {
            return onlineText;
        }
        public WebReader(String strLocation,String strFile){
            Stream strm = null;
            StreamReader MyReader = null;

            try
            {
                // Download the web page.
                strm = GetURLStream("http://" + strLocation +"/" + strFile);
                if (strm != null)
                {
                    // We have a stream, let's attach a byte reader.
                    char[] strBuffer = new char[3001];

                    MyReader = new StreamReader(strm);

                    // Read 3,000 bytes at a time until we get the whole file.
                    string strLine = "";
                    while (MyReader.Read(strBuffer, 0, 3000) > 0)
                    {
                        strLine += new string(strBuffer);

                    }
                    onlineText = strLine;
                }
            }
            catch (Exception excep)
            {
                Console.WriteLine("Error: " + excep.Message);
            }
            finally
            {
                // Clean up and close the stream.
                if (MyReader != null)
                {
                    MyReader.Close();
                }

                if (strm != null)
                {
                strm.Close();
                }
            }
        }
        public Stream GetURLStream(string strURL)
        {
            System.Net.WebRequest objRequest;
            System.Net.WebResponse objResponse = null;
            Stream objStreamReceive;

            try
            {
                objRequest = System.Net.WebRequest.Create(strURL);
                objRequest.Timeout = 5000;


                objResponse = objRequest.GetResponse();
                objStreamReceive = objResponse.GetResponseStream();

                return objStreamReceive;
            }
            catch (Exception excep)
            {
                Console.WriteLine(excep.Message);
                objResponse.Close();

                return null;
            }
        }

        public void ReadWriteStream(Stream readStream, Stream writeStream, frmUpdater _MyParent, int CurrentVersion, long BytesCompleted)
        {
            int Length = 2048;
            Byte[] buffer = new Byte[Length];
            int bytesRead = readStream.Read(buffer, 0, Length);
            // write the required bytes
            while (bytesRead > 0)
            {
                writeStream.Write(buffer, 0, bytesRead);
                bytesRead = readStream.Read(buffer, 0, Length);
                _MyParent.RefreshDownloadLabels(CurrentVersion,BytesCompleted + writeStream.Position);
                Application.DoEvents();
            }
            readStream.Close();
            writeStream.Close();
        }
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从网页上获取文字? - How to get text off a webpage? 我怎样才能得到一个文本<textarea>&lt;i&gt;in a webpage?&lt;/div&gt;&lt;/i&gt;&lt;b&gt;在网页中?&lt;/div&gt;&lt;/b&gt;</textarea><div id="text_translate"><p> Html:</p><pre> &lt;body&gt; &lt;textarea readonly id="txtImagename1" name="txtImagename1" rows="1000" cols="200"&gt; testing second line &lt;/textarea&gt; &lt;/body&gt; &lt;/html&gt;</pre><p> 如何通过 c# 代码获取 textarea 中的文本?</p><blockquote><p> (测试\n第二行)</p></blockquote></div> - How can I get the text of a <textarea> in a webpage? 如何从C#中的网页获取所有显示文本 - How to get all Display text from a webpage in C# 在C#WPF应用程序中加载网页后如何获取文本 - How to get text after webpage is loaded in C# WPF Application 如何在网页上搜索某些文字? - How to search for some text on a webpage? 使用HtmlAgilityPack从网页获取文本 - Get text from webpage using HtmlAgilityPack 如何在网页上获取系统属性? - How to get system properties on a webpage? 如何获取显示的网页源 - How to get the Displayed Webpage source 如何从格式化的网页复制此文本? - How to copy this text from a webpage formatted? 如何从此页面解析全文网页? - How to parse full text webpage from this page?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM