[英]Correct encoding for body from Request NodeJS
I'm trying to scrape a web-page for some data and I managed to post a request and got the right data.我正在尝试抓取一些数据的网页,我设法发布了一个请求并获得了正确的数据。 The problem is that I get something like :问题是我得到了类似的东西:
"Kannst du bitte noch einmal ... erzýhlen , wie du wýhrend der Safari einen Lýwen verjagt hast?" “Kannst du bitte noch einmal ... erzýhlen , wie du wýhrend der Safari einen Lýwen verjagt hast?”
normally erzählen - während , so Ä,Ö,ß,Ü are not showing correctly.通常erzählen - während ,所以 Ä,Ö,ß,Ü 显示不正确。
here is my code:这是我的代码:
var querystring = require('querystring');
var iconv = require('iconv-lite')
var request = require('request');
var fs = require('fs');
var writer = fs.createWriteStream('outputBodyutf8String.html');
var form = {
id:'2974',
opt1:'',
opt2:'30',
ref:'A1',
tid:'157',
tid2:'',
fnum:'2'
};
var formData = querystring.stringify(form);
var contentLength = formData.length;
request({
headers: {
'Content-Length': contentLength,
'Content-Type': 'application/x-www-form-urlencoded'
},
uri: 'xxxxxx.php',
body: formData,
method: 'POST'
}, function (err, res, body) {
var utf8String = iconv.decode(body,"ISO-8859-1");
console.log(utf8String);
writer.write(utf8String);
});
how to get the HTML body in with the correct letters?如何使用正确的字母获取 HTML 正文?
I went to the website you are attempting to scrape, and found this:我去了你试图抓取的网站,发现了这个:
And another character encoding declaration here:还有另一个字符编码声明:
This website defined two different charater encodings!这个网站定义了两种不同的字符编码! Which do I use?我用哪一个?
Well, this doesn't apply to you.嗯,这不适用于你。 When reading an HTML file from a local machine, then the charset
or content-type
defined in the meta tags will be used for encoding.从本地机器读取 HTML 文件时,将使用元标记中定义的charset
或content-type
进行编码。
Since you are retrieving this document, over HTTP, the files will be encoded according to the response header.由于您正在通过 HTTP 检索此文档,因此文件将根据响应标头进行编码。
Here's the reponse header I received after visiting the website.这是我访问网站后收到的响应标题。
As you can see, they don't have a defined character set.如您所见,它们没有定义的字符集。 It should be located in the Content-Type
property.它应该位于Content-Type
属性中。 Like this:像这样:
Since they don't have any indicated charset
in the response header, then, according to this post , it should use the meta
declaration.由于它们在响应标头中没有任何指示的charset
,因此根据这篇文章,它应该使用meta
声明。
But wait , there was two meta
charset
declarations.但是等等,有两个meta
charset
声明。
Since the compiler reads the file top to bottom, the second declared charset
should be used.由于编译器从上到下读取文件,因此应使用第二个声明的charset
。
UTF-8
结论:他们使用UTF-8
Also, I don't think you need the conversion.另外,我认为您不需要转换。 I may be wrong, but you should just be able to access the response.我可能错了,但您应该能够访问响应。
request({
headers: {
'Content-Length': contentLength,
'Content-Type': 'application/x-www-form-urlencoded'
},
uri: 'xxxxxx.php',
body: formData,
method: 'POST'
}, function (err, res, body) {
console.log(body);
writer.write(body);
});
Edit : I don't believe the error is on their side.编辑:我不相信错误在他们一边。 I believe it's on your side.我相信它就在你身边。 Give this a try:试试这个:
Remove the writer:删除作者:
var writer = fs.createWriteStream('outputBodyutf8String.html');
And in the request
callback, replace everything with this:在request
回调中,将所有内容替换为:
function (err, res, body) {
console.log(body);
fs.writeFile('outputBodyutf8String.html', body, 'utf8', function(error) {
if(error)
console.log('Error Occured', error);
);
}
All the code should look like this:所有代码应如下所示:
var querystring = require('querystring');
var iconv = require('iconv-lite')
var request = require('request');
var fs = require('fs');
var form = {
id:'2974',
opt1:'',
opt2:'30',
ref:'A1',
tid:'157',
tid2:'',
fnum:'2'
};
var formData = querystring.stringify(form);
var contentLength = formData.length;
request({
headers: {
'Content-Length': contentLength,
'Content-Type': 'application/x-www-form-urlencoded'
},
uri: 'xxxxxxx.php',
body: formData,
method: 'POST'
}, function (err, res, body) {
console.log(body);
fs.writeFile('outputBodyutf8String.html', body, 'utf8', function(error) {
if(error)
console.log('Error Occured', error);
);
}
Можно сделать запрос с encoding: binary и затем : Можно сделать запрос с 编码:二进制 и затем :
let decoder = new util.TextDecoder ('windows-1251');
console.log(decoder.decode(Buffer.from(b.toString(), 'binary'),{ stream: true }));
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.