Efficient way to edit text tabular file so each cell starts at the same position

Question

I have a text file of table-like structure, each line contains 0 to 4 words split by arbitrary number of spaces.

hello     world  this  is
     an   example  file
is   there a   good
way to    clean this
  your help is   
highly      appreciated

My goal is to edit this file in a format where elements start at the same position across lines, for example:

hello    world        this     is
         an           example  file
is       there        a        good
way      to           clean    this
         your         help     is       
highly   appreciated

The number of spaces is arbitrary. I prefer the lines starting with a space skips the first element, but this is not strict.

I believe there are a lot of ways to do this, my preference order is:

On vim with some neat trick
By bash command
On a text editor with such functionality
By script language (perhaps python)

Since this is a part of data prep/validation process, I do not need a perfect method; I will conduct manual check after all. I am looking for a way that does, say, 80 to 90% of the work.

Can someone suggest an efficient approach?

If useful, example file is here .

Answer 1

Here's a way to get column to respect leading whitespace: change a leading space to some other character

sed 's/^ /_ /' file | column -t | sed 's/^_ /  /'

hello   world        this     is
        an           example  file
is      there        a        good
way     to           clean    this
        your         help     is
highly  appreciated

Answer 2

Python's re module, .format() offer a good approach to 4. .

The column width is based on the length of the longest non-whitespace string in your file + column_pad value.

You can play around with column_pad to vary the actual column width.

If you pass in rename_file=True , you'll get a new file named 'cleaned_<filename> filename`. Otherwise, the script will replace the original file with the cleaned file.

#!/usr/bin/env python
import re
import sys

def clean_columns(filename, rename_file=False, column_pad=4):
    if rename_file:
        cleaned_filename = 'cleaned_' + filename
    else:
        cleaned_filename = filename

    cleaned_text = ''

    with open(filename, 'r') as dirty_file:
        text = dirty_file.readlines()

    string_list = list(
        {string.strip()
                for line in text
                for string in line.strip().split(' ')})

    max_string_length = len(max(string_list, key=len))
    column_width = max_string_length + column_pad
    formatting_string = '{: <' + str(column_width) + '}'

    for line in text:
        line = re.sub(r'\s+',' ', line).split(' ')
        formatting = formatting_string * len(line)
        line = formatting.format(*line)
        cleaned_text += line + '\n'

    with open(cleaned_filename, 'w') as cleaned:
        cleaned.write(cleaned_text)


clean_columns('sample.txt', rename_file=True, column_pad=8)

Output:

hello              world              this               is
                   an                 example            file
is                 there              a                  good
way                to                 clean              this
                   your               help               is
highly             appreciated

Answer 3

You can use the https://github.com/junegunn/vim-easy-align plugin to align various delimiters

Just Select the lines, press:

<CR> : map to <Plug>(EasyAlign)
<CP> : live preview, optional
* : align all delimiters
<CD> : toggle until left align delimiters
<CX>\\s\\@<=\\S\\+ : select non-space after space as delimiter

or use the command: '<,'>EasyAlign */\\s\\@<=\\S\\+/dl

Efficient way to edit text tabular file so each cell starts at the same position

Question

3 answers

solution1
3 ACCPTED 2019-02-13 19:36:01

solution2
2 2019-02-13 20:31:39

solution3
2 2019-02-14 08:22:24

Efficient way to edit text tabular file so each cell starts at the same position

Question

3 answers

solution1 3 ACCPTED 2019-02-13 19:36:01

solution2 2 2019-02-13 20:31:39

solution3 2 2019-02-14 08:22:24

solution1
3 ACCPTED 2019-02-13 19:36:01

solution2
2 2019-02-13 20:31:39

solution3
2 2019-02-14 08:22:24