简体   繁体   English

PHP无法正确解析CSV(文件位于UTF-16LE中)

[英]PHP cannot parse CSV correctly (file is in UTF-16LE)

I am trying to parse a CSV file using PHP. 我正在尝试使用PHP解析CSV文件。
The file uses commas as delimiter and double quotes for fields containing comma(s) , as: 该文件使用逗号作为定界符,并对包含逗号的字段使用双引号 ,例如:

foo,"bar, baz",foo2

The issue I am facing is that I get fields containing comma(s) separated. 我面临的问题是我将包含逗号的字段分隔开了。 I get: 我得到:

  • "2
  • rue du ..."

Instead of: 2, rue du ... . 代替: 2, rue du ...


Encoding: 编码方式:
The file doesn't seem to be in UTF8. 该文件似乎不在UTF8中。 It has weird wharacters at the beginning ( apparently not BOM , looks like this when converted from ASCII to UTF8: ÿþ ) and doesn't displays accents. 它在开始时具有怪异的特征( 显然不是BOM ,当从ASCII转换为UTF8: ÿþ时看起来像这样),并且不显示任何重音符号。

  • My code editor (Atom) tells the encoding is UTF-16 LE 我的代码编辑器(Atom)告诉编码为UTF-16 LE
  • using mb_detect_encoding() on the csv lines it returns ASCII 在csv行上使用mb_detect_encoding()返回ASCII码

But it fails to convert: 但是它无法转换:

  • mb_convert_encoding() converts from ASCII but returns asian characters from UTF-16LE mb_convert_encoding()ASCII转换,但从UTF-16LE返回亚洲字符
  • iconv() returns Notice: iconv(): Wrong charset, conversion from UTF-16LE / ASCII to UTF8 is not allowed . iconv()返回注意:iconv():错误的字符集,不允许从UTF-16LE / ASCII转换为UTF8

Parsing: 解析:
I tried to parse with this one-liner (see those 2 comments ) using str_getcsv() : 我试图使用str_getcsv()来解析这种单行代码(请参阅这2条评论 str_getcsv()

$csv = array_map('str_getcsv', file($file['tmp_name']));

I then tried with fgetcsv() : 然后,我尝试使用fgetcsv()

$f = fopen($file['tmp_name'], 'r');
while (($l = fgetcsv($f)) !== false) {
    $arr[] = $l;
}
$f = fclose($f);

In both ways I get my adress field in 2 parts. 通过两种方式,我将获得2个部分的地址字段。 But when I try this code sample I get correctly parsed fields: 但是当我尝试此代码示例时,我得到了正确解析的字段:

$str = 'foo,"bar, baz",foo2,azerty,"ban, bal",doe';
$data = str_getcsv($str);
echo '<pre>' . print_r($data, true) . '</pre>';

To sum up with questions: 总结问题:

  • What are the characters at the beginning of the file ? 文件开头的字符是什么?
  • How could I be sure about the encoding ? 我如何确定编码? (Atom reads the file with UTF-16 LE and doesn't display weird characters at the beginning) (Atom使用UTF-16 LE读取文件,并且开头不显示奇怪的字符)
  • What makes the csv parsing functions fail ? 是什么使csv解析功能失败?
  • If I should rely on something else to parse the lines of the CSV, what could I use ? 如果我应该依靠其他方法来解析CSV的行,那我可以使用什么呢?

I finally solved it myself: 我终于自己解决了:

I sent the file into online encoding detection websites which returned UTF16LE . 我将该文件发送到了返回UTF16LE的在线编码检测网站。 After checking about what is UTF16LE it says it has BOM (Byte Order Mark) . 在检查了什么是UTF16LE之后,它说它具有BOM(字节顺序标记)
My previous attempts were using file() which returns an array of the lines of a file and with fopen() which returns a resource, but we still parse line by line . 我以前的尝试是使用file()返回文件数组,以及使用fopen()返回资源,但是我们仍然逐行解析。

The working solution came in my mind about converting the whole file (every line at once) instead of converting each line separately. 我想到的工作解决方案是转换整个文件(一次一行),而不是分别转换每一行。 Here is a working solution: 这是一个可行的解决方案:

$f = file_get_contents($file['tmp_name']);          // Get the whole file as string
$f = mb_convert_encoding($f, 'UTF8', 'UTF-16LE');   // Convert the file to UTF8
$f = preg_split("/\R/", $f);                        // Split it by line breaks
$f = array_map('str_getcsv', $f);                   // Parse lines as CSV data

I don't get the adress fields separated at internal commas anymore. 我不再在内部逗号之间分开地址字段。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM