简体   繁体   English

正则表达式+ Python从制表符分隔文件的值中删除特定的尾随和结束字符

[英]Regex + Python to remove specific trailing and ending characters from value in tab delimited file

It's been years (and years) since I've done any regex, so turning to experts on here since it's likely a trivial exercise :) 自完成任何正则表达式以来已经有好几年了,所以在这里寻求专家的帮助,因为这可能是一件微不足道的练习:)

I have a tab delimited file and on each line I have a certain fields that have values such as: 我有一个制表符分隔的文件,并且在每一行上都有某些字段,其值如下:

  • foo FOO
  • bar 酒吧
  • b"foo's bar" b“ foo的酒吧”
  • b'bar foo' b'bar foo'
  • b'carbar' b'carbar”

(A complete line in the file might be something like: (文件中的完整行可能类似于:

123\\t b'bar foo' \\tabc\\t123\\r\\n 123 \\ t b'bar foo' \\ tabc \\ t123 \\ r \\ n

I want to get rid of all the leading b', b" and trailing ", ' from that field on every line. 我想摆脱每一行中该字段的所有前导b',b”和尾随“,'。 So given the example line above, after running the regex, I'd get: 因此,鉴于上面的示例行,在运行正则表达式后,我将得到:

123\\t bar foo \\tabc\\t123\\r\\n 123 \\ t bar foo \\ tabc \\ t123 \\ r \\ n

Bonus points if you can give me the python blurb to run this over the file. 如果您可以给我python blurb以便在文件上运行它,将获得加分。

(^|\\t)b[\\"'] should match the leadings, and for the trailing: (^ | \\ t)b [\\“']应该与前导符匹配,对于尾随符:

\\"' should do it \\“'应该做

In Python, you do: 在Python中,您可以执行以下操作:

import re
r1 = re.compile("(^|\t)b[\"']")
r2 = re.compile("[\"'](\t|$)")

then just use 然后就用

r1.sub("\\1", yourString)
r2.sub("\\1", yourString)

for each line you can use 您可以使用的每一行

re.sub(r'''(?<![^\t\n])\W*b(["'])(.*)\1\W*(?![^\t\n])''', r'\2', line)

and for bonus points: 对于奖励积分:

import re

pattern = re.compile(r'''(?<![^\t\n])\W*b(["'])(.*?)\1\W*?(?![^\t\n])''')
with open('outfile', 'w') as outfile:
    for line in open('infile'):
        outfile.write(pattern.sub(r'\2', line))
>>> "b\"foo's bar\"".replace('b"',"").replace("b'","").rstrip("\"'")
"foo's bar"
>>> "b'bar foo'".replace('b"',"").replace("b'","").rstrip("\"'")
'bar foo'
>>>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM