使用node.js从windows-1250编码的网页获取正确的字符串

Question

I am trying to scrape some data from a webpage with nodejs but I am having problems with character encoding. 我试图从nodejs的网页抓取一些数据，但我遇到字符编码问题。 The web page states that it's encoding is: <meta http-equiv="Content-Type" content="text/html; charset=windows-1250"> And when I browse it with chrome it sets encoding to windows-1250 and everything looks fine. 该网页声明它的编码是： <meta http-equiv="Content-Type" content="text/html; charset=windows-1250">当我用chrome浏览它时，它将编码设置为windows-1250和所有内容看起来很好。

As there is no windows-1250 encoding/decoding for streams in node (and utf8 did not work), I found an iconv-lite package which should be able to easily convert between different encodings. 由于节点中的流没有windows-1250编码/解码（并且utf8不起作用），我发现了一个iconv-lite包，它应该能够轻松地在不同的编码之间进行转换。 But I still get wrong characters after I save the response into a file (or output into console). 但是在将响应保存到文件（或输出到控制台）后，我仍然会收到错误的字符。 I also tried different encodings, native node buffer encodings, setting headers to the same as what I see in chrome ( Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3 ) but nothing seems to work correctly. 我也尝试了不同的编码，本机节点缓冲编码，设置标题与我在chrome中看到的相同（ Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3 ）但是似乎没有什么工作正常。

You can see the whole code in here https://gist.github.com/4110999 . 你可以在这里看到整个代码https://gist.github.com/4110999 。

I suppose I am missing something fundamental regarding how the encoding works so any help on how to get the data with correct characters would be appreciated. 我想我错过了关于编码如何工作的基本信息，所以任何有关如何使用正确字符获取数据的帮助都将受到赞赏。

EDIT: 编辑：
Also tried the node-iconv package in case it is a package problem. 还尝试了node-iconv包，以防它出现包问题。 Changed line 51 to: 将第51行更改为：

var decoder = new Iconv_native('WINDOWS-1250', 'UTF-8');  
var decoded = decoder.convert(body).toString();

but still getting the same results. 但仍然得到相同的结果。

Answer 1

I'm not familiar with the iconv-lite package, but looking through it's code, it looks like you'll need to use win1250 instead of windows1250 (see here ) 我不熟悉iconv-lite软件包，但查看它的代码，看起来你需要使用win1250而不是windows1250 （见这里）

The encodings are looked up as a hash . 编码被查找为哈希。

Also, the readme uses this code instead of 'windows1251': 此外，自述文件使用此代码而不是'windows1251'：

str = iconv.decode(buf, 'win1251');

Answer 2

I think, you are converting String, but you must convert a raw bytes ! 我想，你正在转换String，但你必须转换原始字节 ！ If (you are reading something from web, you must read it as binary) 如果（您正在从Web上读取内容，则必须将其读作二进制）

Example reading file in win-1250 from disk: 从磁盘读取win-1250中的文件的示例：

var Buffer = require('buffer').Buffer;
var Iconv = require('iconv').Iconv; 

//without options (encoding is not specified), 'fs' reads as raw bytes.
var bytes= fs.readFileSync('myFile.txt'); 
//this is bad: var myBadString = fs.readFileSync('myFile.txt', { encoding: "UTF-8" });

var buf = new Buffer(bytes, 'binary');
var translated = new Iconv('CP1250', 'UTF8').convert(buf).toString();

使用node.js从windows-1250编码的网页获取正确的字符串

问题描述

2 个解决方案

解决方案1
1 2012-11-19 15:19:26

解决方案2
0 2013-11-09 15:24:57

使用node.js从windows-1250编码的网页获取正确的字符串

问题描述

2 个解决方案

解决方案1 1 2012-11-19 15:19:26

解决方案2 0 2013-11-09 15:24:57

解决方案1
1 2012-11-19 15:19:26

解决方案2
0 2013-11-09 15:24:57