简体   繁体   中英

Regex + Python to remove specific trailing and ending characters from value in tab delimited file

It's been years (and years) since I've done any regex, so turning to experts on here since it's likely a trivial exercise :)

I have a tab delimited file and on each line I have a certain fields that have values such as:

  • foo
  • bar
  • b"foo's bar"
  • b'bar foo'
  • b'carbar'

(A complete line in the file might be something like:

123\\t b'bar foo' \\tabc\\t123\\r\\n

I want to get rid of all the leading b', b" and trailing ", ' from that field on every line. So given the example line above, after running the regex, I'd get:

123\\t bar foo \\tabc\\t123\\r\\n

Bonus points if you can give me the python blurb to run this over the file.

(^|\\t)b[\\"'] should match the leadings, and for the trailing:

\\"' should do it

In Python, you do:

import re
r1 = re.compile("(^|\t)b[\"']")
r2 = re.compile("[\"'](\t|$)")

then just use

r1.sub("\\1", yourString)
r2.sub("\\1", yourString)

for each line you can use

re.sub(r'''(?<![^\t\n])\W*b(["'])(.*)\1\W*(?![^\t\n])''', r'\2', line)

and for bonus points:

import re

pattern = re.compile(r'''(?<![^\t\n])\W*b(["'])(.*?)\1\W*?(?![^\t\n])''')
with open('outfile', 'w') as outfile:
    for line in open('infile'):
        outfile.write(pattern.sub(r'\2', line))
>>> "b\"foo's bar\"".replace('b"',"").replace("b'","").rstrip("\"'")
"foo's bar"
>>> "b'bar foo'".replace('b"',"").replace("b'","").rstrip("\"'")
'bar foo'
>>>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM