简体   繁体   中英

How to efficiently strip tabs character from a txt format file with Python

The objective is strip tabs character that exist between two strings.

Specifically, I would like to remove the Tab character in between the *Generic and h_two which is highlighted in yellow as depicted below

在此处输入图像描述

the expected output as viewed using Microsoft Office application in a Show Paragraph Mark is a below'

在此处输入图像描述

The file is from a txt format file.

One naive way is as

f_output.write(line.replace('*Generic \t \t', ','))

However, this did not work as intended.

So, there are two issues.

  1. The code below replace all the tab characters instead of only in between the Generic and h_two strings
  1. How to efficiently replace only the tab characters between the sub-strings?

The full code to replicate this issue is.

import pandas as pd

fname = 'endnote_csv_help'
'''
Step 1) Create mock df and save to csv
'''
my_list = ['col_one', 'col_two', 'col_three']
combine_list = [{'h_one', 'h_two', 'h_three'}, my_list, my_list]
df = pd.DataFrame(combine_list)
df.to_csv(f'{fname}.csv', index=False, header=False)

'''
Step 2) Read the csv and convert to txt format
'''

df_shifted = pd.read_csv(f'{fname}.csv', header=None).shift(1, axis=0)
df_shifted.at[0, 0] = '*Generic'
df_shifted.fillna('').to_csv(f'{fname}.txt', sep='\t', index=False, header=False)

'''
Step 3) Read the txt and replace the tab character
'''



with open('endnote_csv_help.txt') as f_input, open('new_endnote_csv_help.txt', 'w') as f_output:
    for line in f_input:
        f_output.write(line.replace('*Generic \t \t', ','))

Note: The thread has been updated slightly upon the response by @Kuldeep.

Input: endnote_csv_help.txt

*Generic        
h_one   h_three h_two
col_one col_two col_three

Output: new_endnote_csv_help.txt

*Generic,,
h_one,h_three,h_two
col_one,col_two,col_three

Reading a line from the input and replacing tabs the writing it to output

with open('endnote_csv_help.txt') as f_input, open('new_endnote_csv_help.txt', 'w') as f_output:
    for line in f_input:
        f_output.write(line.replace('\t', ','))

As appear, there are two character Tab between in between the *Generic and h_two which.

Hence, this can be replace simply by

replace('\t\t', '')

The complete code then as below

with open('endnote_csv_help.txt') as f_input, open('new_endnote_csv_help.txt', 'w') as f_output:
    for line in f_input:
        f_output.write(line.replace('\t\t', ''))

Note that, there should be no spacing between the Character Tabs symbol \t\t .

Thanks to the suggestion by @Kuldeep, it does provide major hint. As a result, his comment will be accepted as answer

per other answer - your error is because you are reading from a file that you have opened for write. If you want to replace multiple instances of tab with blank use a reg expr. This expression matches 2 or more consecutive tabs with empty string

import re
data = '*Generic\t\t\nh_three\th_one\th_two\ncol_one\tcol_two\tcol_three\n'
re.sub("([\t][\t]+)", "", data)

output

'*Generic\nh_three\th_one\th_two\ncol_one\tcol_two\tcol_three\n'

to remove exception, read from file which is opened for read and write to file opened for write.

import pandas as pd
import re

fname = 'endnote_csv_help'
'''
Create mock df and save to csv
'''
my_list = ['col_one', 'col_two', 'col_three']
combine_list = [{'h_one', 'h_two', 'h_three'}, my_list, my_list]
df = pd.DataFrame(combine_list)
df.to_csv(f'{fname}.csv', index=False, header=False)

'''
# Read the csv and convert to txt format
'''

df_shifted = pd.read_csv(f'{fname}.csv', header=None).shift(1, axis=0)
df_shifted.at[0, 0] = '*Generic'
df_shifted.fillna('').to_csv(f'{fname}.txt', sep='\t', index=False, header=False)

'''
Read the txt and replace the tab character
'''

with open(f'{fname}.txt', 'r') as file:
    data = re.sub("([\t][\t]+)", "", file.read())
with open(f'{fname}.txt', 'w') as file:
    file.write(data)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM