简体   繁体   English

下载HTML页面并将其编码为文件

[英]Download and encode HTML page into file

I like to download some web pages which use charset="UTF-8" 我喜欢下载一些使用charset =“ UTF-8”的网页
This page is a sample: http://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_2003 此页面是一个示例: http : //en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_2003
I always end up with special characters like this: Beyoncé instead of Beyoncé 我总是以这样的特殊字符结尾:Beyoncé©而不是Beyoncé
I tried the following code: 我尝试了以下代码:

WebClient webClient = new WebClient();
webClient.Encoding = System.Text.Encoding.UTF8;
webClient.DownloadFile(url, fileName);

or this one: 或这一个:

WebClient client = new WebClient();
Byte[] pageData = client.DownloadData(url);
string pageHtml = Encoding.UTF8.GetString(pageData);
System.IO.File.WriteAllText(fileName, pageHtml);

What do I do wrong? 我做错了什么?
I just want an easy way to download web pages and write them to files. 我只想要一种简单的方法来下载网页并将其写入文件。 After that is done I will extract data from these files and obviously I want "normal" characters like I see on the original web-page and not some special characters. 完成之后,我将从这些文件中提取数据,显然我想要的是“正常”字符,就像我在原始网页上看到的那样,而不是一些特殊字符。

The problem is that the WriteAllText Method don't write the encoded Text in UTF-8 in the File. 问题是WriteAllText方法不会在文件的UTF-8中写入编码的文本。 You should add the Encoding: 您应该添加编码:

System.IO.File.WriteAllText(fileName, pageHtml, Encoding.UTF8);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM