简体   繁体   中英

Regex and removing specific tabs between two parts of a string

I need to remove a tab inbetween two parts in a line. I have a large text file that has around 500k lines. I need to put them into a csv which I am trying to do with pandas using a '\t' delimiter however there are '\t' inside the column 'Body' which messes up the data when its converted to the csv.

So before I use pandas I am now trying to loop through each line and replace '/t' with spaces between the last number of From and including the trailing '\t' and the Type column of 'SM' including the '\t' before the 'SM'. Example below:

ID      TO          FROM        BODY                        TYPE  OTHER COLUMNS
2501    12345678910 12345678910 40m Test Content     Here.  SMxx  x x x x x x x
2502    1234567891  1234567891  Varying Content  Here.      SMxx  x x x x x x x 

So far I have managed to write a reg ex that will find the '\tSM' with the goal of replacing any tab before this regex:

(?<![\w\d])\tSM(?![\w\d])

I then tried to write one that would look at the content after any numbers longer than 9 but less than 13 but I wasn't able to get it to work.

I am not sure what the best way is to find and replace all the '\t' in the 'Body' part of the txt file only while leaving all the other '\t' delimiters alone.

Any help appreciated:)

You may use

import re
rx = re.compile(r'^((?:[^\t]*\t){3})(.*?)(?=\t\s*SM)')
with open(filepath, 'r') as f:
    with open(f"{filepath}.out", 'w', newline="\n", encoding="utf-8") as fw:
        for line in f:
            fw.write(rx.sub(lambda x: f"{x.group(1)}{x.group(2).replace(chr(9), '')}", line))

See the Python demo :

import re
line = '    2501    12345678910 12345678910 40m Test Content\t Here.    SMxx  x x x x x x x'
rx = re.compile(r'^((?:[^\t]*\t){3})(.*?)(?=\t\s*SM)')
print(rx.sub(lambda x: f"{x.group(1)}{x.group(2).replace(chr(9), '<TAB WAS HERE>')}", line))
# =>     2501   12345678910 12345678910 40m Test Content<TAB WAS HERE> Here.    SMxx  x x x x x x x

See the regex demo . Details :

  • ^ - start of string
  • ((?:[^\t]*\t){3}) - Group 1: three occurrences of zero or more chars other than TAB and then a TAB char
  • (.*?) - Group 2: any zero or more chars other than line break chars as few as possible
  • (?=\t\s*SM) - a positive lookahead that requires a TAB, zero or more whitespaces and then SM immediately to the right of the current location.

The replacement is a concatenation of Group 1 value ( x.group(1) ) and Group 2 value with all TABs replaced with an empty string ( x.group(2).replace(chr(9), '') ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM