How can I extract all contents of a website, not only a webpage? If we consider a website named www.abc.com
, how can we get all of the contents from all of the page of this site? I have tested a code but it is to get the contents of a single page of a website only using C#.
string urlAddress = "https://www.motionflix.xyz/";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
Stream receiveStream = response.GetResponseStream();
StreamReader readStream = null;
if (String.IsNullOrWhiteSpace(response.CharacterSet))
readStream = new StreamReader(receiveStream);
else
readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
string data = readStream.ReadToEnd();
Console.WriteLine(data);
response.Close();
readStream.Close();
}
When you load that page in a browser, it will only get (server-sided browser switching aside) what you get with your request. What the browser then does and what you need to do in your code is parse this content - it contains references (eg via <script>
, <img>
, <link>
, <iframe>
and others) that will give the URLs of the other resources to load.
It might be easier to use a prebuilt application such as wget
if it does what you need or use browser automation.
If you wants to Download a complete website including all of its contents, then you can use a software HTTrack.HTTrack allows users to download World Wide Web sites from the Internet to a local computer.Here is the link you can follow. https://www.httrack.com/page/2/en/index.html
Note, that you may want to check whether an URL is still on the same Domain, otherwise you might accidently scan the whole internet.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.