简体   繁体   中英

How to escape only delimiter and not the newline character in CSV

I am receiving normal comma delimited CSV files with data having new line character.

Input data

I want to convert the input data to:

  1. Pipe (|) delimited
  2. Without any quotes to escape (" or ')
  3. Pipe (|) within data escaped with a caret (^) character

My file may also contain multiple lines on data (or data in newline in a single row).

Expected output data

Output file I was able to generate.

输出数据

As you can see in the image that caret (^) perfectly escaped all pipes (|) in data, but also escaping the newline character in 5th and 6th line, which I don't want.

NOTE: All the carriage returns (\\r, or CR) and newline (\\n, LF) characters should be as it is just like shown in images.

import csv
import sys

inputPath = sys.argv[1]
outputPath = sys.argv[2]
with open(inputPath, encoding="utf-8") as inputFile:
    with open(outputPath, 'w', newline='', encoding="utf-8") as outputFile:
        reader = csv.DictReader(inputFile, delimiter=',')
        writer = csv.DictWriter(
            outputFile, reader.fieldnames, delimiter='|', quoting=csv.QUOTE_NONE, escapechar='^', doublequote=False, quotechar="")
        writer.writeheader()
        writer.writerows(reader)

print("Formationg complete.")

The above code has been written in Python, it would be great if I can get help in Python. Answers in other programming languages also accepted.

There is more than 8 million records

Please find below some sample data:

"VENDOR ID","VENDOR NAME","ORGANIZATION NUMBER","ADDRESS 1","CITY","COUNTRY","ZIP","PRIMARY PHONE","FAX","EMAIL","LMS RECORD CREATED DATE","LMS RECORD MODIFY DATE","DELETE FLAG","LMS RECORD ID"
"a0E6D000001Fag8UAC","Test 'Vendor' 1","","This Vendor contains a single (') quote.","","","","","","test@test.com","2020-4-1 06:32:29","2020-4-1 06:34:43","false",""
"a0E6D000001FagDUAS","Test ""Vendor"" 2","","This Vendor contains a double("") quote.","","","","","","test@test.com","2020-4-1 06:33:38","2020-4-1 06:35:18","false",""
"a0E6D000001FagIUAS","Test Vendor | 3","","This Vendor contains a Pipe (|).","","","","","","test@test.com","2020-4-1 06:38:45","2020-4-1 06:38:45","false",""
"a0E6D000001FagNUAS","Test Vendor 4","","This Vendor contains a
carriage return, i.e 
data in new line.","","","","","","test@test.com","2020-4-1 06:43:08","2020-4-1 06:43:08","false",""

NOTE: If you copy above data, please make sure that 5th and 6th line should end with only LF (ie New Line, \\n) just like shown in images, or else please try to replicate those 2 line as that's what this question is all about not escaping those 2 lines specificaly, as highlighted in the image below.

The above code is the final out come of all my findings on internet. I've even tried pandas library and it's final output is same as well.

The code below is just an alternate way to get my expected output, but still the issue exists as this script takes forever (more than 12 hours) to complete (and still not finishes, ultimately I have to kill the process) when ran on 9 Millions of records.

Batch wrapper for VBS code:

0</* :
    @echo off

        cscript /nologo /E:jscript "%~f0" %*

    exit /b %errorlevel% */0;

        var ARGS = WScript.Arguments;

        if (ARGS.Length < 3 ) {
            WScript.Echo("Wrong arguments");
            WScript.Echo(WScript.ScriptName + " path_to_file search replace [search replace[search replace [...]]]");
            WScript.Echo(WScript.ScriptName + " e?path_to_file search replace [search replace[search replace [...]]]");
            WScript.Echo("if filename starts with \"e?\" search and replace string will be evaluated for special characters ")
            WScript.Quit(1);
        }

        if (ARGS.Item(0).toLowerCase() == "-help" || ARGS.Item(0).toLowerCase() == "-h") {
            WScript.Echo(WScript.ScriptName + " path_to_file search replace [search replace[search replace [...]]]");
            WScript.Echo(WScript.ScriptName + " e?path_to_file search replace [search replace[search replace [...]]]");
            WScript.Echo("if filename starts with \"e?\" search and replace string will be evaluated for special characters ")
            WScript.Quit(0);
        }



        if (ARGS.Length % 2 !== 1 ) {
            WScript.Echo("Wrong arguments");
            WScript.Quit(2);
        }

        var jsEscapes = {
          'n': '\n',
          'r': '\r',
          't': '\t',
          'f': '\f',
          'v': '\v',
          'b': '\b'
        };


        //string evaluation
        //http://stackoverflow.com/questions/24294265/how-to-re-enable-special-character-sequneces-in-javascript

        function decodeJsEscape(_, hex0, hex1, octal, other) {
          var hex = hex0 || hex1;
          if (hex) { return String.fromCharCode(parseInt(hex, 16)); }
          if (octal) { return String.fromCharCode(parseInt(octal, 8)); }
          return jsEscapes[other] || other;
        }

        function decodeJsString(s) {
          return s.replace(
              // Matches an escape sequence with UTF-16 in group 1, single byte hex in group 2,
              // octal in group 3, and arbitrary other single-character escapes in group 4.
              /\\(?:u([0-9A-Fa-f]{4})|x([0-9A-Fa-f]{2})|([0-3][0-7]{0,2}|[4-7][0-7]?)|(.))/g,
              decodeJsEscape);
        }

        function convertToPipe(find, replace, str) {        
          return str.replace(new RegExp('\\|','g'),"^|");
        }

        function removeStartingQuote(find, replace, str) {      
          return str.replace(new RegExp('^"', 'g'), '');
        }

        function removeEndQuote(find, replace, str) {       
          return str.replace(new RegExp('"\r\n$', 'g'), '\r\n');
        }

        function removeLeadingAndTrailingQuotes(find, replace, str) {       
          return str.replace(new RegExp('"\r\n"', 'g'), '\r\n');
        }

        function replaceDelimiter(find, replace, str) {     
          return str.replace(new RegExp('","', 'g'), '|');
        }

        function convertSFDCDoubleQuotes(find, replace, str) {      
          return str.replace(new RegExp('""', 'g'), '"');
        }


      function getContent(file) {
            // :: http://www.dostips.com/forum/viewtopic.php?f=3&t=3855&start=15&p=28898  ::
            var ado = WScript.CreateObject("ADODB.Stream");
            ado.Type = 2;  // adTypeText = 2

            ado.CharSet = "iso-8859-1";  // code page with minimum adjustments for input
            ado.Open();
            ado.LoadFromFile(file);

            var adjustment = "\u20AC\u0081\u201A\u0192\u201E\u2026\u2020\u2021" +
                             "\u02C6\u2030\u0160\u2039\u0152\u008D\u017D\u008F" +
                             "\u0090\u2018\u2019\u201C\u201D\u2022\u2013\u2014" +
                             "\u02DC\u2122\u0161\u203A\u0153\u009D\u017E\u0178" ;


            var fs = new ActiveXObject("Scripting.FileSystemObject");
            var size = (fs.getFile(file)).size;

            var lnkBytes = ado.ReadText(size);
            ado.Close();
            var chars=lnkBytes.split('');
            for (var indx=0;indx<size;indx++) {
                if ( chars[indx].charCodeAt(0) > 255 ) {
                   chars[indx] = String.fromCharCode(128 + adjustment.indexOf(chars[indx]));
                }
            }
            return chars.join("");
       }

       function writeContent(file,content) {
            var ado = WScript.CreateObject("ADODB.Stream");
            ado.Type = 2;  // adTypeText = 2
            ado.CharSet = "iso-8859-1";  // right code page for output (no adjustments)
            //ado.Mode=2;
            ado.Open();

            ado.WriteText(content);
            ado.SaveToFile(file, 2);
            ado.Close();    
       }

        if (typeof String.prototype.startsWith != 'function') {
          // see below for better implementation!
          String.prototype.startsWith = function (str){
            return this.indexOf(str) === 0;
          };
        }


        var evaluate=false;
        var filename=ARGS.Item(0);
        if(filename.toLowerCase().startsWith("e?")) {
            filename=filename.substring(2,filename.length);
            evaluate=true;
        }
        var content=getContent(filename);
        var newContent=content;
        var find="";
        var replace="";

        for (var i=1;i<ARGS.Length-1;i=i+2){
            find=ARGS.Item(i);
            replace=ARGS.Item(i+1);
            if(evaluate){
                find=decodeJsString(find);
                replace=decodeJsString(replace);
            }
            newContent=convertToPipe(find,replace,newContent);
            newContent=removeStartingQuote(find,replace,newContent);        
            newContent=removeEndQuote(find,replace,newContent);
            newContent=removeLeadingAndTrailingQuotes(find,replace,newContent);
            newContent=replaceDelimiter(find,replace,newContent);       
            newContent=convertSFDCDoubleQuotes(find,replace,newContent);        
        }

        writeContent(filename,newContent);

Execution Steps:

> replace.bat <file_name or full_path_to_file> "." "."

This batch file is made for the purpose of any file's manipulation according to our requirement.

I've compiled and made this from lot of google searches. It's still in process as I've hardcoded my regular expressions in the file. You can make changes according to your need in the functions i've made, or even make your own functions by replicating other functions, and calling them at the end.

Another alternateive to what I want to achive I've done using Wondows Powershell script.

((Get-Content -path $args[0] -Raw) -replace '\|', '^|') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '^"', '') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace "`"\r\n$", "") | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '"\r\n"', "`r`n") | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '","', '|') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '""', '"' ) | Set-Content -Path $args[0]

Execution Ways:

  1. Using Powershell

    replace.ps1 '< path_to_file >'

  2. Using a Batch Script

    C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -ExecutionPolicy ByPass -command "& '< path_to_ps_script >\\replace.ps1' '< path_to_csv_file >.csv'"

NOTE: Powershell V5.0 or greater required

This can process 1 Million of records in a minute or so.

What I've figured out is that we have to split bulky csv files to multiplve file with 1 Million records each and then process them all seperately.

Please correct me if I'm wrong, or there's any other alternate to it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM