简体   繁体   中英

Regex : Remove all commas between a quote separated string [python]

What would be an appropriate regex to remove all commas in a string as such:

12, 1425073747, "test", "1, 2, 3, ... "

Result:

12, 1425073747, "test", "1 2 3 ... "

What I have that matches correctly:

"((\d+), )+\d+"

However, I obviously cant replace this with $1 $2. I can't use "\\d+, \\d+" because it will match 12, 1425073747 which is not what I want. If someone can explain how to recursively parse out values that would be appreciated as well.

This should work for you:

>>> input = '12, 1425073747, "test", "1, 2, 3, ... "';
>>> print re.sub(r'(?!(([^"]*"){2})*[^"]*$),', "", input);
12, 1425073747, "test", "1 2 3 ... "

(?!(([^"]*"){2})*[^"]*$) matches text only if inside quotea -- avoid matching even number of quotes after comma.

You may use a re.sub with a simple r'"[^"]*"' regex and pass the match object to a callable used as the replacement argument where you may further manipulate the match:

import re
text = '12, 1425073747, "test", "1, 2, 3, ... "'
print( re.sub(r'"[^"]*"', lambda x: x.group().replace(",", ""), text) )

See the Python demo .

If the string between quotes may contain escaped quotes use

re.sub(r'(?s)"[^"\\]*(?:\\.[^"\\]*)*"', lambda x: x.group().replace(",", ""), text)

Here, (?s) is the inline version of a re.S / re.DOTALL flag and the rest is the double quoted string literal matching pattern.

Bonus

  • Removing all whitespace in between double quotes: re.sub(r'"[^"]*"', lambda x: ''.join(x.group().split()), text)
  • Remove all non-digit chars inside double quotes: re.sub(r'"[^"]*"', lambda x: ''.join(c for c in x.group() if c.isdigit()), text)
  • Remove all digit chars inside double quotes: re.sub(r'"[^"]*"', lambda x: ''.join(c for c in x.group() if not c.isdigit()), text)

the solution offered by anubhava was very useful, in fact the only one that worked from the guides that I found - that means really removed the semicolons reliably in quoted text. However, using it on a 640 kB text file (yes, 640) took like 3 minutes, which was not acceptable even on an oldish i5.

The solution for me was to implement a C++ function:

#include <string>
#include <cstring>
#include <iostream>

using namespace std;

extern "C" // required when using C++ compiler
    const char *
    erasesemi(char *s)
{
    bool WeAreIn = false;
    long sl = strlen(s);
    char *r = (char*) malloc(sl+1);
    strcpy(r, s);
    for (long i = 0; (i < (sl - 1)); i++)
    {
        if (s[i] == '"')
        {
            WeAreIn = not(WeAreIn);
        }
        if ((s[i] == ';') & WeAreIn)
        {
            r[i] = ',';
        }
        else
        {
            r[i] = s[i];
        }
    }
    return r;
}

from what I found in the internets, I used this setup.py

from setuptools import setup, Extension

# Compile *mysum.cpp* into a shared library
setup(
    # ...
    ext_modules=[Extension('erasesemi', ['erasesemi.cpp'],), ],
)

after that you have to run

python3 setup.py build

the appropriate lines in the main code were:

import ctypes
import glob
libfile = glob.glob(
    'build/lib.linux-x86_64-3.8/erasesemi.cpython-38-x86_64-linux-gnu.so')[0]

mylib = ctypes.CDLL(libfile)

mylib.erasesemi.restype = ctypes.c_char_p
mylib.erasesemi.argtypes = [ctypes.c_char_p]

..

data3 = mylib.erasesemi(str(data2).encode('latin-1'))

Like this, it produced the desired result in < 1 second. The most tricky part was to find out how to pass strings with german characters to the c++ function. Naturally, you can use any encoding you want.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM