简体   繁体   English

使用python修复列缩进

[英]Fix columns indentation with python

There is a file format called .xyz that helps visualizing molecular bonds. 有一种名为.xyz的文件格式,可帮助可视化分子键。 Basically the format asks for a specific pattern: 基本上,格式要求特定的模式:

At the first line there must be the number of atoms, which in my case is 30. After that there should be the data where the first line is the name of the atom, in my case they are all carbon. 第一行必须有原子数,在我的情况下为30。此后应该有数据,其中第一行是原子的名称,在我的情况下,它们都是碳。 The second line is the x information and the third line is the y information and the last line is the z information which are all 0 in my case. 在我的情况下,第二行是x信息,第三行是y信息,最后一行是z信息。 Indentation should be correct so that all of the corresponding lines should start at the same place. 缩进应该正确,以便所有相应的行都应从同一位置开始。 So something like this: 所以像这样:

30
C x1 y1 z1 
C x2 y2 z2
...
...
...

and not: 并不是:

30 
C x1 y1 z1
C   x2 y2  z2

since this is the wrong indentation. 因为这是错误的缩进。

My generated data is stored like this in a .txt file: 我生成的数据像这样存储在.txt文件中:

C       2.99996     7.31001e-05     0
C       2.93478     0.623697        0
C       2.74092     1.22011     0
C       2.42702     1.76343     0
C       2.0079      2.22961     0
C       1.50006     2.59812     0
C       0.927076        2.8532      0
C       0.313848        2.98349     0
C       -0.313623       2.9837      0
C       -0.927229       2.85319     0
C       -1.5003     2.5981      0
C       -2.00732        2.22951     0
C       -2.42686        1.76331     0
C       -2.74119        1.22029     0
C       -2.93437        0.623802        0
C       -2.99992        -5.5509e-05     0
C       -2.93416        -0.623574       0
C       -2.7409     -1.22022        0
C       -2.42726        -1.7634     0
C       -2.00723        -2.22941        0
C       -1.49985        -2.59809        0
C       -0.92683        -2.85314        0
C       -0.313899       -2.98358        0
C       0.31363     -2.98356        0
C       0.927096        -2.85308        0
C       1.50005     -2.59792        0
C       2.00734     -2.22953        0
C       2.4273      -1.76339        0
C       2.74031     -1.22035        0
C       2.93441     -0.623647       0

I want to correct the indentation of this by making all of the lines start from the same point. 我想通过使所有线条都从同一点开始来纠正这种缩进。 I tried to do this with AWK to no avail. 我试图用AWK做到这一点无济于事。 So I turned to Python. 所以我转向了Python。 So far I have this: 到目前为止,我有这个:

#!/usr/bin/env/python
text_file = open("output.txt","r")
lines = text_file.readlines()
myfile = open("output.xyz","w")
for line in lines:
    atom, x, y, z = line.split()
    x, y, z = map(float(x,y,z))
    myfile.write("{}\t {}\t {}\t {}\t".format(atom,x,y,z))
myfile.close()
text_file.close()

but I don't know currently as to how indentation can be added into this. 但我目前不知道如何在其中添加缩进。

tl;dr: I have a data file in . tl; dr:我在中有一个数据文件。 txt , I want to change it into .xyz that's been specified but I am running into problems with indentation. txt ,我想将其更改为已指定的.xyz ,但我遇到了缩进问题。

It appears that I misinterpreted your requirement... 看来我误解了您的要求...

To achieve a fixed width output using awk, you could use printf with a format string like this: 要使用awk实现固定宽度的输出,可以将printf与以下格式的字符串一起使用:

$ awk '{printf "%-4s%12.6f%12.6f%5d\n", $1, $2, $3, $4}' data.txt 
C       2.999960    0.000073    0
C       2.934780    0.623697    0
C       2.740920    1.220110    0
C       2.427020    1.763430    0
C       2.007900    2.229610    0
C       1.500060    2.598120    0
C       0.927076    2.853200    0
C       0.313848    2.983490    0
C      -0.313623    2.983700    0
# etc.

Numbers after the % specify the width of the field. %之后的数字指定字段的宽度。 A negative number means that the output should be left aligned (as in the first column). 负数表示输出应保持对齐(如第一列所示)。 I have specified 6 decimal places for the floating point numbers. 我为浮点数指定了6个小数位。


Original answer, in case it is useful: 原始答案,以防万一:

To ensure that there is a tab character between each of the columns of your input, you could use this awk script: 为了确保输入的每一列之间都有一个制表符,您可以使用以下awk脚本:

awk '{$1=$1}1' OFS="\t" data.txt > output.xyz

$1=$1 just forces awk to touch each line, which makes sure that the new Output Field Separator ( OFS ) is applied. $1=$1只是强制awk触摸每行,从而确保应用了新的输出字段分隔符( OFS )。

awk scripts are built up from a series of condition { action } . awk脚本是根据一系列condition { action }构建的。 If no condition is supplied, the action is performed for every line. 如果没有提供条件,则对每行执行该操作。 If a condition but no action is supplied, the default action is to print the line. 如果提供条件但不提供任何操作,则默认操作是打印该行。 1 is a condition that always evaluates to true, so awk prints the line. 1是始终求值为true的条件,因此awk打印该行。

Note that even though the columns are all tab-separated, they are still not lined up because the content of each column is of a variable length. 请注意,即使各列都用制表符分隔,但由于每列的内容长度是可变的,因此它们仍未对齐。

Your data has already been ill formatted and converted to string. 您的数据已经过格式化,并已转换为字符串。 To correctly allign the numeric and non-numeric data, you need to parse the individual fields to respective data types (possibly using duck-typing) before formating using str.format 要正确分配数字和非数字数据,需要在使用str.format进行格式化之前将各个字段解析为各自的数据类型(可能使用鸭子类型)。

for line in st.splitlines():
    def convert(st):
        try:
            return int(st)
        except ValueError:
            pass
        try:
            return float(st)
        except ValueError:
            pass
        return st
    print "{:8}{:12.5f}{:12.5f}{:5d}".format(*map(convert,line.split()))


C            2.99996     0.00007    0
C            2.93478     0.62370    0
C            2.74092     1.22011    0
C            2.42702     1.76343    0
C            2.00790     2.22961    0
C            1.50006     2.59812    0
C            0.92708     2.85320    0
C            0.31385     2.98349    0
C           -0.31362     2.98370    0
C           -0.92723     2.85319    0

Using this: awk '{printf "%s\\t%10f\\t%10f\\t%i\\n",$1,$2,$3,$4}' atoms 使用以下命令: awk '{printf "%s\\t%10f\\t%10f\\t%i\\n",$1,$2,$3,$4}' atoms

give this output: 给出以下输出:

C         2.999960        0.000073      0
C         2.934780        0.623697      0
C         2.740920        1.220110      0
C         2.427020        1.763430      0
C         2.007900        2.229610      0
C         1.500060        2.598120      0
C         0.927076        2.853200      0
C         0.313848        2.983490      0
C        -0.313623        2.983700      0
C        -0.927229        2.853190      0
C        -1.500300        2.598100      0
C        -2.007320        2.229510      0
C        -2.426860        1.763310      0
C        -2.741190        1.220290      0
C        -2.934370        0.623802      0
C        -2.999920       -0.000056      0
C        -2.934160       -0.623574      0
C        -2.740900       -1.220220      0
C        -2.427260       -1.763400      0
C        -2.007230       -2.229410      0
C        -1.499850       -2.598090      0
C        -0.926830       -2.853140      0
C        -0.313899       -2.983580      0
C         0.313630       -2.983560      0
C         0.927096       -2.853080      0
C         1.500050       -2.597920      0
C         2.007340       -2.229530      0
C         2.427300       -1.763390      0
C         2.740310       -1.220350      0
C         2.934410       -0.623647      0

Is it what you're meaning or did I misunderstood ? 是您的意思,还是我误解了?

Edit for side note: I used tabs \\t for separation, a space could do too and I limited the output to a precision of 10, I didn't verify your input lenght 编辑旁注:我使用制表符\\t进行分隔,空格也可以,而且我将输出的精度限制为10,我没有验证输入长度

You can use string formatting to print values with consistent padding. 您可以使用字符串格式来打印具有一致填充的值。 For your case, you might write lines like this to the file: 对于您的情况,您可以将以下行写入文件:

>>> '%-12s %-12s %-12s %-12s\n' % ('C', '2.99996', '7.31001e-05', '0')
'C            2.99996      7.31001e-05  0           '

"%-12s" means "take the str() of the value and make it take up at least 12 characters left-justified. “%-12s”表示“取值的str(),并使其至少左对齐12个字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM