[英]Efficient way to edit text tabular file so each cell starts at the same position
I have a text file of table-like structure, each line contains 0 to 4 words split by arbitrary number of spaces. 我有一个表格结构的文本文件,每行包含0到4个单词,并以任意数量的空格分隔。
hello world this is
an example file
is there a good
way to clean this
your help is
highly appreciated
My goal is to edit this file in a format where elements start at the same position across lines, for example: 我的目标是以文件在行中相同位置开始的格式编辑此文件,例如:
hello world this is
an example file
is there a good
way to clean this
your help is
highly appreciated
The number of spaces is arbitrary. 空格数是任意的。 I prefer the lines starting with a space skips the first element, but this is not strict. 我希望以空格开头的行跳过第一个元素,但这并不严格。
I believe there are a lot of ways to do this, my preference order is: 我相信有很多方法可以做到这一点,我的偏好顺序是:
Since this is a part of data prep/validation process, I do not need a perfect method; 由于这是数据准备/验证过程的一部分,因此我不需要完美的方法。 I will conduct manual check after all. 毕竟,我将进行手动检查。 I am looking for a way that does, say, 80 to 90% of the work. 我正在寻找一种可以完成80%至90%的工作的方法。
Can someone suggest an efficient approach? 有人可以建议一种有效的方法吗?
Here's a way to get column
to respect leading whitespace: change a leading space to some other character 这是一种使column
尊重前导空格的方法:将前导空格更改为其他字符
sed 's/^ /_ /' file | column -t | sed 's/^_ / /'
hello world this is
an example file
is there a good
way to clean this
your help is
highly appreciated
Python's re
module, .format()
offer a good approach to 4.
. Python的re
模块.format()
为4.
提供了一种很好的方法。
The column width is based on the length of the longest non-whitespace string in your file + column_pad
value. 列宽基于文件中最长的非空白字符串的长度+ column_pad
值。
You can play around with column_pad
to vary the actual column width. 您可以使用column_pad
来改变实际的列宽。
If you pass in rename_file=True
, you'll get a new file named 'cleaned_<filename>
filename`. 如果您传递rename_file=True
,则会得到一个名为'cleaned_<filename>
filename`的新文件。 Otherwise, the script will replace the original file with the cleaned file. 否则,脚本将用清除的文件替换原始文件。
#!/usr/bin/env python
import re
import sys
def clean_columns(filename, rename_file=False, column_pad=4):
if rename_file:
cleaned_filename = 'cleaned_' + filename
else:
cleaned_filename = filename
cleaned_text = ''
with open(filename, 'r') as dirty_file:
text = dirty_file.readlines()
string_list = list(
{string.strip()
for line in text
for string in line.strip().split(' ')})
max_string_length = len(max(string_list, key=len))
column_width = max_string_length + column_pad
formatting_string = '{: <' + str(column_width) + '}'
for line in text:
line = re.sub(r'\s+',' ', line).split(' ')
formatting = formatting_string * len(line)
line = formatting.format(*line)
cleaned_text += line + '\n'
with open(cleaned_filename, 'w') as cleaned:
cleaned.write(cleaned_text)
clean_columns('sample.txt', rename_file=True, column_pad=8)
Output: 输出:
hello world this is
an example file
is there a good
way to clean this
your help is
highly appreciated
You can use the https://github.com/junegunn/vim-easy-align plugin to align various delimiters 您可以使用https://github.com/junegunn/vim-easy-align插件来对齐各种分隔符
Just Select the lines, press: 只需选择行,然后按:
<CR>
: map to <Plug>(EasyAlign)
<CR>
:映射到<Plug>(EasyAlign)
<CP>
: live preview, optional <CP>
:实时预览,可选 *
: align all delimiters *
:对齐所有定界符 <CD>
: toggle until left align delimiters <CD>
:切换直到左对齐定界符 <CX>\\s\\@<=\\S\\+
: select non-space after space as delimiter <CX>\\s\\@<=\\S\\+
:选择空格后的非空格作为定界符 or use the command: '<,'>EasyAlign */\\s\\@<=\\S\\+/dl
或使用以下命令: '<,'>EasyAlign */\\s\\@<=\\S\\+/dl
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.