简体   繁体   中英

How to get all contents of a website not only a webpage in c#

How can I extract all contents of a website, not only a webpage? If we consider a website named www.abc.com , how can we get all of the contents from all of the page of this site? I have tested a code but it is to get the contents of a single page of a website only using C#.

string urlAddress = "https://www.motionflix.xyz/";

        HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();

        if (response.StatusCode == HttpStatusCode.OK)
        {
            Stream receiveStream = response.GetResponseStream();
            StreamReader readStream = null;

            if (String.IsNullOrWhiteSpace(response.CharacterSet))
                readStream = new StreamReader(receiveStream);
            else
                readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));

            string data = readStream.ReadToEnd();
            Console.WriteLine(data);
            response.Close();
            readStream.Close();
        }

When you load that page in a browser, it will only get (server-sided browser switching aside) what you get with your request. What the browser then does and what you need to do in your code is parse this content - it contains references (eg via <script> , <img> , <link> , <iframe> and others) that will give the URLs of the other resources to load.

It might be easier to use a prebuilt application such as wget if it does what you need or use browser automation.

If you wants to Download a complete website including all of its contents, then you can use a software HTTrack.HTTrack allows users to download World Wide Web sites from the Internet to a local computer.Here is the link you can follow. https://www.httrack.com/page/2/en/index.html

  1. Create a list containing all the URLs that have already been scraped
  2. Create a loop that starts with a given URL, which is added to the URL list and then scrape the content of that page and search it for href tags (=new URLs). If the new URL is not in the list already repeat step 2 with this new URL. Go on as long as there are new URLs that have not been scraped yet.

Note, that you may want to check whether an URL is still on the same Domain, otherwise you might accidently scan the whole internet.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM