What would be an appropriate regex to remove all commas in a string as such:
12, 1425073747, "test", "1, 2, 3, ... "
Result:
12, 1425073747, "test", "1 2 3 ... "
What I have that matches correctly:
"((\d+), )+\d+"
However, I obviously cant replace this with $1 $2. I can't use "\\d+, \\d+" because it will match 12, 1425073747 which is not what I want. If someone can explain how to recursively parse out values that would be appreciated as well.
This should work for you:
>>> input = '12, 1425073747, "test", "1, 2, 3, ... "';
>>> print re.sub(r'(?!(([^"]*"){2})*[^"]*$),', "", input);
12, 1425073747, "test", "1 2 3 ... "
(?!(([^"]*"){2})*[^"]*$)
matches text only if inside quotea -- avoid matching even number of quotes after comma.
You may use a re.sub
with a simple r'"[^"]*"'
regex and pass the match object to a callable used as the replacement argument where you may further manipulate the match:
import re
text = '12, 1425073747, "test", "1, 2, 3, ... "'
print( re.sub(r'"[^"]*"', lambda x: x.group().replace(",", ""), text) )
See the Python demo .
If the string between quotes may contain escaped quotes use
re.sub(r'(?s)"[^"\\]*(?:\\.[^"\\]*)*"', lambda x: x.group().replace(",", ""), text)
Here, (?s)
is the inline version of a re.S
/ re.DOTALL
flag and the rest is the double quoted string literal matching pattern.
Bonus
re.sub(r'"[^"]*"', lambda x: ''.join(x.group().split()), text)
re.sub(r'"[^"]*"', lambda x: ''.join(c for c in x.group() if c.isdigit()), text)
re.sub(r'"[^"]*"', lambda x: ''.join(c for c in x.group() if not c.isdigit()), text)
the solution offered by anubhava was very useful, in fact the only one that worked from the guides that I found - that means really removed the semicolons reliably in quoted text. However, using it on a 640 kB text file (yes, 640) took like 3 minutes, which was not acceptable even on an oldish i5.
The solution for me was to implement a C++ function:
#include <string>
#include <cstring>
#include <iostream>
using namespace std;
extern "C" // required when using C++ compiler
const char *
erasesemi(char *s)
{
bool WeAreIn = false;
long sl = strlen(s);
char *r = (char*) malloc(sl+1);
strcpy(r, s);
for (long i = 0; (i < (sl - 1)); i++)
{
if (s[i] == '"')
{
WeAreIn = not(WeAreIn);
}
if ((s[i] == ';') & WeAreIn)
{
r[i] = ',';
}
else
{
r[i] = s[i];
}
}
return r;
}
from what I found in the internets, I used this setup.py
from setuptools import setup, Extension
# Compile *mysum.cpp* into a shared library
setup(
# ...
ext_modules=[Extension('erasesemi', ['erasesemi.cpp'],), ],
)
after that you have to run
python3 setup.py build
the appropriate lines in the main code were:
import ctypes
import glob
libfile = glob.glob(
'build/lib.linux-x86_64-3.8/erasesemi.cpython-38-x86_64-linux-gnu.so')[0]
mylib = ctypes.CDLL(libfile)
mylib.erasesemi.restype = ctypes.c_char_p
mylib.erasesemi.argtypes = [ctypes.c_char_p]
..
data3 = mylib.erasesemi(str(data2).encode('latin-1'))
Like this, it produced the desired result in < 1 second. The most tricky part was to find out how to pass strings with german characters to the c++ function. Naturally, you can use any encoding you want.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.