I have a text file of table-like structure, each line contains 0 to 4 words split by arbitrary number of spaces.
hello world this is
an example file
is there a good
way to clean this
your help is
highly appreciated
My goal is to edit this file in a format where elements start at the same position across lines, for example:
hello world this is
an example file
is there a good
way to clean this
your help is
highly appreciated
The number of spaces is arbitrary. I prefer the lines starting with a space skips the first element, but this is not strict.
I believe there are a lot of ways to do this, my preference order is:
Since this is a part of data prep/validation process, I do not need a perfect method; I will conduct manual check after all. I am looking for a way that does, say, 80 to 90% of the work.
Can someone suggest an efficient approach?
If useful, example file is here .
Here's a way to get column
to respect leading whitespace: change a leading space to some other character
sed 's/^ /_ /' file | column -t | sed 's/^_ / /'
hello world this is
an example file
is there a good
way to clean this
your help is
highly appreciated
Python's re
module, .format()
offer a good approach to 4.
.
The column width is based on the length of the longest non-whitespace string in your file + column_pad
value.
You can play around with column_pad
to vary the actual column width.
If you pass in rename_file=True
, you'll get a new file named 'cleaned_<filename>
filename`. Otherwise, the script will replace the original file with the cleaned file.
#!/usr/bin/env python
import re
import sys
def clean_columns(filename, rename_file=False, column_pad=4):
if rename_file:
cleaned_filename = 'cleaned_' + filename
else:
cleaned_filename = filename
cleaned_text = ''
with open(filename, 'r') as dirty_file:
text = dirty_file.readlines()
string_list = list(
{string.strip()
for line in text
for string in line.strip().split(' ')})
max_string_length = len(max(string_list, key=len))
column_width = max_string_length + column_pad
formatting_string = '{: <' + str(column_width) + '}'
for line in text:
line = re.sub(r'\s+',' ', line).split(' ')
formatting = formatting_string * len(line)
line = formatting.format(*line)
cleaned_text += line + '\n'
with open(cleaned_filename, 'w') as cleaned:
cleaned.write(cleaned_text)
clean_columns('sample.txt', rename_file=True, column_pad=8)
Output:
hello world this is
an example file
is there a good
way to clean this
your help is
highly appreciated
You can use the https://github.com/junegunn/vim-easy-align plugin to align various delimiters
Just Select the lines, press:
<CR>
: map to <Plug>(EasyAlign)
<CP>
: live preview, optional *
: align all delimiters <CD>
: toggle until left align delimiters <CX>\\s\\@<=\\S\\+
: select non-space after space as delimiter or use the command: '<,'>EasyAlign */\\s\\@<=\\S\\+/dl
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.