python 3.7.4
I've a *.csv that contains numerous instances of the character string
High School
and numerous instances of the hexadecimal-pair
C3 82
which I'd like remove.
def findem( fn, patt):
p = re.compile(patt)
with open( fn, newline = '\n') as fp:
for line in fp.readlines():
m = p.search( line)
if( m):
print('found {0}'.format(line))
fn_inn = "Contacts_prod.csv"
patt_hs = "High School"
patt_C382 = r'\\xC3\\x82'
print('trying patt_hs')
findem( fn_inn, patt_hs) # <------- finds all rows containing High School, great
print('trying patt_C382')
findem( fn_inn, patt_C382) # <------- doesnt find anything and should
As written it should print out which rows contain the pattern. With patt
= "High School"
everything works as expected. With patt
= r'\xc3\x82'
nothing gets found.
Any ideas?
The trick was to 1) quit thinking in terms of finding and displaying each occurrence and remember the goal is to remove all occurrences and 2) think in terms of binary. Then it became simple, but with some subtleties:
def findem( patt):
p = re.compile(patt)
with open( fn_out, 'wb') as fp_out: #binary input
with open( fn_inn, 'rb') as fp_inn: #binary output
slurp_i = fp_inn.read() # slurp_i is of type bytes
slurp_o = p.sub( b'', slurp_i) # notice the b'' , very subtle
fp_out.write( slurp_o)
fn_inn = "Contacts_prod.csv"
fn_out = "Contacts_prod.fixed.dat"
patt = re.compile(b'\xC3\x82') # notice the b'' instead of r'', very subtle
findem( patt)
Thanks to all that responded. All Hail SO!
Still-learning Steve
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.