简体   繁体   中英

How to properly escape double quotes in a malformed json input before php json_encode

I have a large json file that I'm importing before treatment. I do not control the content of this file, and the json is improperly escaped, leading to some cases of double quotes inside double quotes enclosed strings. for example :

/...../
"productdiploma" : [],
"productcursus" : ["Méthodes"],
"productpublishedonsite" : false,
"productpublishedonkne" : false,
"productypologyparentality" : [ {
    "productmediaformat" : "01- Livre",
    "producttechformat" : "50- Grand format - "Autre"",
    "productparent" : ""
    }],
"productparentwork" : [ {
    "productparentworkid" : "1000248048",
    "productparentworktitle" : "TRAVAILLER EN FRANCAIS "EN ENTREPRISE" 2007"
/...../

In my import, the file is treated as a very large string with file_get_contents(). I probably need a preg_replace() or a preg_filter() there, but I can't quite figure what patterns I'd need to find and escape those double quotes with a \\". Any Help/ideas out there ?

(and see comment below in anwser to shibon to see why it's not a duplicate)

I'd suggest you need a different approach here. Loading the file as one large string means it's nearly impossible to know which double quotes are wanted and which ones are not. I'd go with an approach that allows you to read the fine line by line instead;

<?php
$file = fopen("test.txt","r");

while(! feof($file))
  {
  $line = fgets($file);
  }

fclose($file);
?>

This will allow you to test just the right hand side of each : , like this;

$partToTest = explode($line, ':')[1];

Knowing that for items that have quotes at all (ie, not the arrays), they should be the first and last character on each line. So you could do something along the lines of;

If the part has quotes, remove the first and last

if($partToTest.substr($line, 0, -1) === "\""))
    $partToTest  = substr($line, 0, -1); # removes the last quote
    $partToTest  = substr($line, 1, 0); # removes the first quote

Replace any remaining quotes with \\"

    $partToTest  = str_replace("\"", "\\\""); # escape remaining quotes

Append and prepend new quotes to replace the ones we removed

    $partToTest = "\"" . $partToTest . "\""; 

Put the line back together.

    $line = explode($line, ':')[0] + $partToTest;
}

Admittedly, my php skills aren't the best so there may be a much simpler way of doing this, but the principle should work.

I managed to work out this code, based on @lewis' idea :

$content = '';
while(! feof($infile) )
  {
    // reset line values at each iteration.
    $final_line = '';
    $test = array();

    // get the next line
    $line = trim(fgets($infile));

    // test if the line is splitable, else, white as is ({, [ etc...])
    if(strpos($line,'" : "')) {

        // split the line at quote+space+':'.space+quote to avoid splitting strings containing just ' : ' (not fool proof as a string might still contain '" : "' and split, but works in my case)

        $test = explode('" : "',$line) ;
        // add the final quote that we just stripped in the splitting to the key 
        $key =  $test[0].'"';

        // test if the line ends with a comma or not to decide at which position to remove the last quote 
        if( strpos($test[1], '",') == (strlen($test[1])-2) ){
            $val = substr($test[1],0,-2);
            $comma = ','; // store a comma for latter use
        } else {
            $val = substr($test[1],0,-1);
            $comma = '';
        }
        // no need to remove remove the fist quote it's been taken care of at splitting

        // replace the double quotes inside the trimmed string
        $val =  str_replace('"','\"', trim($val));
        // reassemble the corrected line
        $final_line = $key . ' : "' . $val . '"'. $comma ."\n";
    } else {
        $final_line = $line ."\n";
    }

    //store the line for later treatment
    $content .= utf8_encode($final_line);
  }

That does the job, though it's significantly slower, and there is still room for errors in the splitting if the line contains the '" : "' string inside the part I want to test, but that's a fix anyway :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM