正则表达式并删除字符串两部分之间的特定制表符

Question

I need to remove a tab inbetween two parts in a line.我需要删除一行中两个部分之间的制表符。 I have a large text file that has around 500k lines.我有一个大约有 50 万行的大文本文件。 I need to put them into a csv which I am trying to do with pandas using a '\t' delimiter however there are '\t' inside the column 'Body' which messes up the data when its converted to the csv.我需要将它们放入 csv 中，我正在尝试使用 '\t' 分隔符处理 pandas，但是 'Body' 列中有 '\t'，当它转换为 csv 时会弄乱数据。

So before I use pandas I am now trying to loop through each line and replace '/t' with spaces between the last number of From and including the trailing '\t' and the Type column of 'SM' including the '\t' before the 'SM'.因此，在我使用 pandas 之前，我现在尝试遍历每一行，并将“/t”替换为 From 的最后一个数字之间的空格，包括尾随的“\t”和“SM”的类型列，包括“\t”在“SM”之前。 Example below:示例如下：

ID      TO          FROM        BODY                        TYPE  OTHER COLUMNS
2501    12345678910 12345678910 40m Test Content     Here.  SMxx  x x x x x x x
2502    1234567891  1234567891  Varying Content  Here.      SMxx  x x x x x x x

So far I have managed to write a reg ex that will find the '\tSM' with the goal of replacing any tab before this regex:到目前为止，我已经设法编写了一个可以找到 '\tSM' 的正则表达式，目的是替换此正则表达式之前的任何制表符：

(?<![\w\d])\tSM(?![\w\d])

I then tried to write one that would look at the content after any numbers longer than 9 but less than 13 but I wasn't able to get it to work.然后，我尝试编写一个在任何长于 9 但小于 13 的数字之后查看内容的方法，但我无法让它工作。

I am not sure what the best way is to find and replace all the '\t' in the 'Body' part of the txt file only while leaving all the other '\t' delimiters alone.我不确定最好的方法是只查找和替换 txt 文件的“正文”部分中的所有“\t”，同时保留所有其他“\t”分隔符。

Any help appreciated:)任何帮助表示赞赏:)

Answer 1

You may use你可以使用

import re
rx = re.compile(r'^((?:[^\t]*\t){3})(.*?)(?=\t\s*SM)')
with open(filepath, 'r') as f:
    with open(f"{filepath}.out", 'w', newline="\n", encoding="utf-8") as fw:
        for line in f:
            fw.write(rx.sub(lambda x: f"{x.group(1)}{x.group(2).replace(chr(9), '')}", line))

See the Python demo :请参阅Python 演示：

import re
line = '    2501    12345678910 12345678910 40m Test Content\t Here.    SMxx  x x x x x x x'
rx = re.compile(r'^((?:[^\t]*\t){3})(.*?)(?=\t\s*SM)')
print(rx.sub(lambda x: f"{x.group(1)}{x.group(2).replace(chr(9), '<TAB WAS HERE>')}", line))
# =>     2501   12345678910 12345678910 40m Test Content<TAB WAS HERE> Here.    SMxx  x x x x x x x

See the regex demo .请参阅正则表达式演示。 Details :详情：

^ - start of string ^ - 字符串的开始
((?:[^\t]*\t){3}) - Group 1: three occurrences of zero or more chars other than TAB and then a TAB char ((?:[^\t]*\t){3}) - 第 1 组：出现三个零个或多个 TAB 以外的字符，然后是一个 TAB 字符
(.*?) - Group 2: any zero or more chars other than line break chars as few as possible (.*?) - 第 2 组：除换行符以外的任何零个或多个字符尽可能少
(?=\t\s*SM) - a positive lookahead that requires a TAB, zero or more whitespaces and then SM immediately to the right of the current location. (?=\t\s*SM) - 正向前瞻，需要 TAB、零个或多个空格，然后SM立即位于当前位置的右侧。

The replacement is a concatenation of Group 1 value ( x.group(1) ) and Group 2 value with all TABs replaced with an empty string ( x.group(2).replace(chr(9), '') ).替换是第 1 组值 ( x.group(1) ) 和第 2 组值的串联，所有 TAB 都替换为空字符串 ( x.group(2).replace(chr(9), '') )。

正则表达式并删除字符串两部分之间的特定制表符

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-02-15 11:14:00

正则表达式并删除字符串两部分之间的特定制表符

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-02-15 11:14:00

解决方案1
1 已采纳 2022-02-15 11:14:00