简体   繁体   中英

Regex not finding specific pair of hexadecimal character

python 3.7.4

I've a *.csv that contains numerous instances of the character string

High School

and numerous instances of the hexadecimal-pair

C3 82

which I'd like remove.

def findem( fn, patt):
  p = re.compile(patt)
  with open( fn, newline = '\n') as fp:
    for line in fp.readlines():
      m = p.search( line)
      if( m):
        print('found {0}'.format(line))

fn_inn = "Contacts_prod.csv"

patt_hs   = "High School"
patt_C382 = r'\\xC3\\x82'

print('trying patt_hs')
findem( fn_inn, patt_hs)    # <------- finds all rows containing High School, great

print('trying patt_C382')
findem( fn_inn, patt_C382)  # <------- doesnt find anything and should

As written it should print out which rows contain the pattern. With patt = "High School" everything works as expected. With patt = r'\xc3\x82' nothing gets found.

Any ideas?

The trick was to 1) quit thinking in terms of finding and displaying each occurrence and remember the goal is to remove all occurrences and 2) think in terms of binary. Then it became simple, but with some subtleties:

def findem( patt):
  p = re.compile(patt)
  with open( fn_out, 'wb') as fp_out:   #binary input
    with open( fn_inn, 'rb') as fp_inn: #binary output
      slurp_i = fp_inn.read()           # slurp_i is of type bytes
      slurp_o = p.sub( b'', slurp_i)    # notice the b'' , very subtle
      fp_out.write( slurp_o)

fn_inn = "Contacts_prod.csv"
fn_out = "Contacts_prod.fixed.dat"

patt = re.compile(b'\xC3\x82')         # notice the b'' instead of r'', very subtle
findem( patt)

Thanks to all that responded. All Hail SO!

Still-learning Steve

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM