简体   繁体   中英

encoding issue special character csv to php

So i've got this file ( http://mountainmarathon.ch/components/com_chronoconnectivity6/chronoconnectivity/uploads/20190814194827_classifica-cat-standard-3.csv ) which "should" be encoded in utf-8. When i try to read the contents via fgetcsv or file_get_contents i got those black diamonds with question marks for each ä,ö,ü character.

I already know that this is an encoding issue but as far as i can see everything is / should be utf-8 and utf-8 should be able to display ä,ö,ü, right?

I have already checked a lot of possible solutions here but did not find any solution. When i open the file with notepad++ i got the same strange problem with the diamonds (even when i try to change the encoding - then it changes to a rectangle). - so its the file?

nope then when i open the csv file on my iphone (inside mail app) the special chars ä,ö,ü are displayed correctly.

what i have tried so far was different mb_convert_encoding solutions from different stack overflow solutions but none of them worked.

I think really something is not correct with this file but why the iPhone is able to render the content correctly?

Can someone with more know how please check the file and tell me what i can do to import / use its content with PHP and get rid of those encoding issue?

Header is set to UTF-8 via header('Content-Type: text/html; charset=utf-8');

in terminal "file -I file" returns UTF-8

i've tried two servers (my mamp with php7.3.1 & webserver with php7.x)

I'm sorry but i do not post every link of every question i've checked here and on other platforms from the past three hours. And yes of course i have already checked plenty of infos and comments on php manual (fgetcsv, mb_encode / check , utf8_encode / decode... and so on) but did not found the needle which solves my issue.

lastly i've checked my string (from file gets content) against this function: https://www.php.net/manual/de/function.mb-check-encoding.php#95289 which returns FALSE.

and now nothing makes sense anymore.

the code to reproduce is very simple:

$content = file_get_contents($url);
var_dump($content);

how can we display the special chars as ä,ö,ü and not as black diamonds with questionmarks.

Update

Based on your analysis i have checked what exactly happens about file saving.

first: i receive the csv by email and as far as i can see it is in iso-8859-1

the iOS Scenario looks so: i open the mail in the mail app and display the csv directly inside the mail app --> all fine. Next i exported the file by mail app into my onedrive --> check to open the file on the phone --> all fine. Now i am able to check for the charset on my mac via file -I and it is iso-8859-1 .

When i am now try to use this file and php's utf8_encode --> all is good.

So now i had to understand what went wrong before, for that here is the MacOS scenario:

I open the (same) mail and save the same src file onto my harddrive, a quick check with file -I now gives me UTF-8 as charset.

On a windows machine with outlook, save file, open in notepad the characters are replaced: ä=>d, ü=>|, ...

I think right now - that the person which sends us this csv has to export the file as utf-8, for me it looks like its iso-8859-1 and the computers are do some weird stuff while saving the file is that possible?

This response may be a bit meandering, but hope it provides useful info. I'm running these commands on an ubuntu workstation in a terminal window.

I downloaded the file using Firefox. The response headers did not specify any charset:

$ curl -sSL -D - http://mountainmarathon.ch/components/com_chronoconnectivity6/chronoconnectivity/uploads/20190814194827_classifica-cat-standard-3.csv -o /dev/null
HTTP/1.1 301 Moved Permanently
Server: nginx
Date: Wed, 14 Aug 2019 21:24:00 GMT
Content-Type: text/html
Content-Length: 162
Connection: keep-alive
Keep-Alive: timeout=60
Location: http://www.mountainmarathon.ch/components/com_chronoconnectivity6/chronoconnectivity/uploads/20190814194827_classifica-cat-standard-3.csv
Strict-Transport-Security: max-age=63072000

HTTP/1.1 200 OK
Server: nginx
Date: Wed, 14 Aug 2019 21:24:00 GMT
Content-Type: text/csv
Content-Length: 39626
Connection: keep-alive
Keep-Alive: timeout=60
X-Content-Type-Options: nosniff
Last-Modified: Wed, 14 Aug 2019 19:48:27 GMT
ETag: "9aca-590190a7aa557"
Accept-Ranges: bytes
Strict-Transport-Security: max-age=63072000

If I inspect the beginning of the file, I do indeed see the weird characters you are talking about:

head -c 30 20190814194827_classifica-cat-standard-3.csv
11;1;102;Claudio;Br�ndli;198

That first weird character is represented by 3 bytes, ef bf bd :

$ head -c 30 20190814194827_classifica-cat-standard-3.csv | xxd
00000000: 3131 3b31 3b31 3032 3b43 6c61 7564 696f  11;1;102;Claudio
00000010: 3b42 72ef bfbd 6e64 6c69 3b31 3938       ;Br...ndli;198

That byte sequence corresponds to the UTF-8 replacement character , ie, the character used to replace problematic byte sequences. This strongly suggests that the original file itself does not have the chars with umlauts that you want, but rather it contains the replacement character instead.

I've tried opening this file in a text editor (gedit) and in LibreOffice calc using numerous different encodings and the characters do not appear correctly in any combination of app and encoding that I've tried.

I put those 3 umlaut characters in a string and none of those strings matches that 3-byte string that is in your file:

$ echo "äöü" | xxd
00000000: c3a4 c3b6 c3bc 0a                        .......

To clarify, I believe a UTF-8 encoding of these characters maps as follows:

ä = c3a4
ö = c3b6
ü = c3bc

I could be wrong here, but I think that remote website might actually contain the UTF-8 replacement character inside it? I wonder if the nginx server that's coughing up the file might be attempting to interpret this file's contents and failing? I tried setting up a PHP script to send accept-charset headers and it still gets the broken chars.

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,"http://www.mountainmarathon.ch/components/com_chronoconnectivity6/chronoconnectivity/uploads/20190814194827_classifica-cat-standard-3.csv");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

$headers = [
    'Accept-Charset: utf-8',
    'Accept-Encoding: gzip, deflate',
    'Accept-Language: en-US,en;q=0.5',
    'Cache-Control: no-cache',
//  'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'
    'User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 12_4 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.2 Mobile/15E148 Safari/604.1'
Firefox/68.0'
];
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);

$server_output = curl_exec ($ch);
file_put_contents("server-output.csv", $server_output);

curl_close ($ch);
echo "DONE\n";

To summarize, I think your original source file has already replaced the chars you want (ä, ö, ü, etc) with the generic UTF8 character used to signify a misunderstood byte sequence ( ). Either that or the CSV file is getting munged by the server that is coughing it up for some reason? Can you tell me more about viewing this file on your iPhone? Are you requesting it from that exact url with your iphone?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM