简体   繁体   English

将重整字符转换回 UTF-8

[英]Convert mangled characters back to UTF-8

Here is what I did:这是我所做的:

  1. I dumped a SQLite database with UTF-8 data ( sqlite3 example.db .dump > dump.sql ), but since this was in powershell, I assume the piping converted it to windows-1252我转储了一个带有 UTF-8 数据的 SQLite 数据库( sqlite3 example.db .dump > dump.sql ),但由于这是在 powershell 中,我假设管道将其转换为 windows-1252
  2. I loaded that dumped data into a new database, again using powershell ( Get-Content dump.sql | sqlite3 example2.db )我再次使用 powershell ( Get-Content dump.sql | sqlite3 example2.db ) 将转储的数据加载到新数据库中
  3. I dumped that new database and am left with a new .sql file (this time it was not through powershell - so I assume it was unmodified)我转储了那个数据库并留下了一个新的.sql文件(这次它不是通过 powershell - 所以我认为它没有被修改)

This new sql file's UTF-8 characters are seriously mangled, and I was wondering if there was a way to convert it back into correct UTF-8.这个新的 sql 文件的 UTF-8 字符严重损坏,我想知道是否有办法将它转换回正确的 UTF-8。

As a few examples, here are what some sequences are in the new file, and what they should be (all are viewed as UTF-8):举几个例子,这里是文件中的一些序列,以及它们应该是什么(都被视为 UTF-8):

  1. ÒüéÒü¬ÒüƒÒü½ should beあなたにÒüéÒü¬ÒüƒÒü½应该是あなたに
  2. ´╝ü should be a full width exclamation mark ´╝ü应该是全角感叹号
  3. Òé¡Òé╗Òé¡ should beキセキÒé¡Òé╗Òé¡应该是キセキ

Does anyone have any idea as to how I might undo this mangling?有没有人知道我可以如何撤消这种破坏? Any method would be very helpful!任何方法都会非常有帮助!

This is in powershell 7.0.1这是在 PowerShell 7.0.1


On further inspection, you can duplicate my predicament by redirecting any such data to a file in powershell (note that the data cannot itself be entered in powershell).在进一步检查时,您可以通过将任何此类数据重定向到 powershell 中的文件来复制我的困境(请注意,数据本身不能在 powershell 中输入)。 Hence, setting up a script like this gives the same outcome:因此,设置这样的脚本会产生相同的结果:


echo "キ"

And then running wsl ./test.sh > test.txt will give an output of Òé¡ , not然后运行wsl ./test.sh > test.txt将给出Òé¡的输出,而不是

Edit 2:编辑2:

It seems as if the codepage the UTF-8 text was converted to is almost 437: some characters are restored using this assumption (eg), but others are not.似乎 UTF-8 文本转换成的代码页几乎是 437:使用此假设(例如)恢复了一些字符,但其他字符则不然。 If it's close to 437, but isn't, what could it be?如果它接近 437,但不是,那可能是什么?

事实证明,因为我在英国,所以我想要的代码页是 850。将文件保存为 850,然后将其重新加载为 UTF-8 解决了我的问题!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM