简体   繁体   English

PHP(ISO-8859-1、UTF-8、CP1250)中的编码转换

[英]Encoding conversion in PHP (ISO-8859-1, UTF-8, CP1250)

I want to work with data from CSV file, but I realized letters are not showing correctly.我想使用 CSV 文件中的数据,但我意识到字母显示不正确。 I tried million ways to convert the encoding but nothing works.我尝试了数百万种方法来转换编码,但没有任何效果。 Working on MacOS, PHP 7.4.4.在 MacOS 上工作,PHP 7.4.4。

After executing fgets() or fgetcsv() on handle variable, I will get this (2 rows/lines in example).在句柄变量上执行fgets()fgetcsv()后,我会得到这个(例如 2 行/行)。

Kód ADM;Kód obce;Název obce;Kód MOMC;Název MOMC;Kód MOP;Název MOP;Kód èásti obce;Název èásti obce;Kód ulice;Název ulice;Typ SO;Èíslo domovní;Èíslo orientaèní;Znak èísla orientaèního;PSÈ;Souøadnice Y;Souøadnice X;Platí Od

1234;1234;HorniDolni;;;;;1234;HorniDolni;;;è.p.;2;;;748790401;4799.98;15893971.21;2013-12-01T00:00:00

It is more or less correct czech language, but letter č is superseded by è and ř is superseded by ø , neither of them are part of czech alphabet.它或多或少是正确的捷克语,但字母čè取代, řø取代,它们都不是捷克字母表的一部分。 I am confident, there will be more of the misplaced letters in the file.我有信心,文件中会出现更多错位的字母。

Executing file -I path/to/file I receive file: text/plain; charset=iso-8859-1执行file -I path/to/file我收到file: text/plain; charset=iso-8859-1 file: text/plain; charset=iso-8859-1 which is sad, because as far as wiki is concerned, this charset doesn't have a czech alphabet included. file: text/plain; charset=iso-8859-1这是可悲的,因为就wiki而言,这个字符集没有包含捷克语字母表。

Neither of following commands didn't converted misplaced letters: mb_convert_encoding($line, 'UTF-8', 'ISO8859-1') iconv('ISO-8859-1', 'UTF-8', $line) iconv('ISO8859-1', 'UTF-8', $line)以下命令都没有转换错位的字母: mb_convert_encoding($line, 'UTF-8', 'ISO8859-1') iconv('ISO-8859-1', 'UTF-8', $line) iconv('ISO8859-1', 'UTF-8', $line)

I have noticed that in ISO-8859-1 the ø letter has a code 00F8 .我注意到在ISO-8859-1ø字母有一个代码00F8 Windows-1250 (which includes czech aplhabet) has correct letter ř with code 0159 but both of them are preceded by 00F8 . Windows-1250 (包括捷克语 aplhabet)具有正确的字母ř代码0159但它们都以00F8 Same with letter č and è which are both preceded by code 00E7 .与前面有代码00E7的字母čè相同。 I do not understand encoding very deeply, but it seems that file is encoded in Windows-1250 but the interpreter thinks the encoding is ISO-8859-1 and takes letter that is in place/code of original one.我不太了解编码,但似乎文件是在 Windows-1250 中编码的,但解释器认为编码是 ISO-8859-1 并采用原位的字母/代码。

But neither conversion (ISO-8859-1 => Windows-1250, ISO-8859-1 => UTF-8 or other way around) is working.但是两种转换(ISO-8859-1 => Windows-1250、ISO-8859-1 => UTF-8 或其他方式)都不起作用。

Does anyone has any idea how to solve this?有谁知道如何解决这个问题? Thanks!谢谢!

The problem with 8-bit character encoding is that it mostly needs human intelligence to interpret the correct codepage. 8 位字符编码的问题在于,它主要需要人类智能来解释正确的代码页。

When you run file on a file, it can work out that the file is mostly made up of printable characters but as it's only looking at the bytes, it can't easily tell the difference between iso-8895-1 and iso-8895-2.当您在file上运行文件时,可以确定该文件主要由可打印字符组成,但由于它只查看字节,因此无法轻松区分 iso-8895-1 和 iso-8895- 2. To file , 0x80 is the same as 0x80 .file0x800x80相同。

file can only tell that the file is text and likely iso-8895-* or windows-*, because of the use of 0x80-0xFF . file由于使用了0x80-0xFF ,因此只能判断该文件是文本文件,并且可能是 iso-8895-* 或 windows-*。 Ie not just ASCII.即不仅仅是ASCII。

(Unicode encodings, like UTF-8, and UTF-16 are easier to detect by their byte sequence or Byte Order Mark set at the top of the file) (Unicode 编码,如 UTF-8 和 UTF-16 更容易通过它们的字节序列或文件顶部设置的字节顺序标记来检测)

There are some intelligent character codepage detectors that, with the help of dictionaries from different languages, can estimate the codepage based on character/byte sequences.有一些智能字符代码页检测器,在不同语言的字典的帮助下,可以根据字符/字节序列估计代码页。

The likely conversion you need is simply iso-8895-2 -> UTF-8 .您需要的可能转换只是iso-8895-2 -> UTF-8

What is important for you is that you know the original encoding (interpretation) and then when you validate it, that you know exactly what encoding you're viewing it.对您来说重要的是您知道原始编码(解释),然后当您验证它时,您确切地知道您正在查看它的编码。

For example, PHP will by default set the HTTP charset to iso-8895-1 .例如, PHP 默认将 HTTP 字符集设置为iso-8895-1 That means it's quite possible for you to be converting correctly to iso-8895-2 , but your browser will then "interpret" as iso-8895-1 .这意味着您很有可能正确转换为iso-8895-2 ,但您的浏览器随后将“解释”为iso-8895-1

The best way to validate is to save the file to disk, then use a text editor like VS Code set to your required encoding beforehand before opening the file.验证的最佳方法是将文件保存到磁盘,然后在打开文件之前使用文本编辑器(如 VS Code )预先设置为所需的编码

If you need further help, you will need to edit your question to include the exact code you're using.如果您需要进一步的帮助,您将需要编辑您的问题以包含您正在使用的确切代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM