简体   繁体   中英

sed to python replace extra delimiters in a

sed 's/\\t/_tab_/3g'

I have a sed command that basically replaces all excess tab delimiters in my text document. My documents are supposed to be 3 columns, but occasionally there's an extra delimiter. I don't have control over the files.

I use the above command to clean up the document. However all my other operations on these files are in python. Is there a way to do the above sed command in python?

sample input:

Column1   Column2         Column3
James     1,203.33        comment1
Mike      -3,434.09       testing testing 123
Sarah     1,343,342.23    there   here

sample output:

Column1   Column2         Column3
James     1,203.33        comment1
Mike      -3,434.09       testing_tab_testing_tab_123
Sarah     1,343,342.23    there_tab_here

You may read the file line by line, split with tab, and if there are more than 3 items, join the items after the 3rd one with _tab_ :

lines = []
with open('inputfile.txt', 'r') as fr:
    for line in fr:
        split = line.split('\t')
        if len(split) > 3:
            tmp = split[:2]                      # Slice the first two items
            tmp.append("_tab_".join(split[2:]))  # Append the rest joined with _tab_
            lines.append("\t".join(tmp))         # Use the updated line
        else:
            lines.append(line)                   # Else, put the line as is

See the Python demo

The lines variable will contain something like

Mike    -3,434.09   testing_tab_testing_tab_123
Mike    -3,434.09   testing_tab_256
No  operation   here
import os
os.system("sed -i 's/\t/_tab_/3g' " + file_path)

Does this work? Please notice that there is a -i argument for the above sed command, which is used to modify the input file inplace.

You can mimic the sed behavior in python:

import re

pattern = re.compile(r'\t')
string = 'Mike\t3,434.09\ttesting\ttesting\t123'
replacement = '_tab_'
count = -1
spans = []
start = 2 # Starting index of matches to replace (0 based)
for match in re.finditer(pattern, string):
    count += 1
    if count >= start:
        spans.append(match.span())
spans.reverse()
new_str = string
for sp in spans:
     new_str = new_str[0:sp[0]] + replacement + new_str[sp[1]:]

And now new_str is 'Mike\\t3,434.09\\ttesting_tab_testing_tab_123' .

You can wrap it in a function and repeat for every line. However, note that this GNU sed behavior isn't standard:

'NUMBER' Only replace the NUMBERth match of the REGEXP.

  interaction in 's' command Note: the POSIX standard does not specify what should happen when you mix the 'g' and NUMBER modifiers, and currently there is no widely agreed upon meaning across 'sed' implementations. For GNU 'sed', the interaction is defined to be: ignore matches before the NUMBERth, and then match and replace all matches from the NUMBERth on. 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM