简体   繁体   English

Python:读取和写入复杂且重复格式的文件

[英]Python: Read and write the file of complex and reapeating format

To begin with, sorry for poor Engish. 首先,对英语不好表示抱歉。 I have a file with repeating format. 我有一个重复格式的文件。 Such as

      326                                         Iteration:       0 #Bonds:       10
    1    6    7   14   54   70   77    0    0    0    0    0    1  0.693  0.632  0.847  0.750  0.644  0.000  0.000  0.000  0.000  0.000  3.566  0.000  0.028
    2    6    3    6   15   55    0    0    0    0    0    0    1  0.925  0.920  0.909  0.892  0.000  0.000  0.000  0.000  0.000  0.000  3.645  0.000 -0.040
    3    6    2    8   10   52    0    0    0    0    0    0    1  0.925  0.910  0.920  0.898  0.000  0.000  0.000  0.000  0.000  0.000  3.653  0.000  0.000
...
  324    8  323    0    0    0    0    0    0    0    0    0  100  0.871  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.871  3.000 -0.493
  325    2  326    0    0    0    0    0    0    0    0    0  101  0.930  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.930  0.000  0.334
  326    8  325    0    0    0    0    0    0    0    0    0  101  0.930  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.930  3.000 -0.611
   637.916060425841        306.094529423257        1250.10511927236
  6.782126993565285E-006
      326 (repeating from here)                   Iteration:     100 #Bonds:       10
    1    6    7   14   54   64   70   77    0    0    0    0    1  0.885  0.580  0.819  0.335  0.784  0.709  0.000  0.000  0.000  0.000  4.111  0.000  0.025
    2    6    3    6   15   55    0    0    0    0    0    0    1  0.812  0.992  0.869  0.966  0.000  0.000  0.000  0.000  0.000  0.000  3.639  0.000 -0.034
    3    6    2    8   10   52    0    0    0    0    0    0    1  0.812  0.966  0.989  0.926  0.000  0.000  0.000  0.000  0.000  0.000  3.692  0.000  0.004
  • As you can see here, the first line is the header, and 2nd~327th line is the data that I want to analyze, and 328th and 329th line have some numbers which I don't want to use. 如您所见,第一行是标题,第二行至第327行是我要分析的数据,第328行和第329行有一些我不想使用的数字。 Next "frame" starts from line 330, with exactly same format. 下一个“帧”从第330行开始,格式完全相同。 This "frame" repeats more than 200000 times. 此“帧”重复超过200000次。
  • I want to use 1st ~ 13th column from that 2nd~327th line data of each frames. 我想从每个帧的第2〜327行数据中使用第1〜13列。 Also I want to use first number of header. 我也想使用标题的第一个数字。
  • I want to analyze the data, 3th~12th column of each 2nd~327th line of all repeating "frames", printing number of 0s and number of non-0s data from of target matrix of each frames. 我想分析数据,所有重复的“帧”的第二行至第327行的第3列至第12列,从每个帧的目标矩阵中打印0s和非0s数据的数量。 Also print some 1st, 2nd and 13th column as well. 还要打印第一,第二和第十三列。 So the expected output file become like 因此,预期的输出文件变为

     326 1 1 6 5 5 1 2 6 4 6 1 ... 325 2 1 9 101 326 8 1 9 101 326 (Next frame starts from here) 2 1 6 5 5 1 2 6 4 6 1 ... 326 3 1 6 5 5 1 2 6 4 6 1 ... 
  • First line: First number of first line. 第一行:第一行的第一号。
  • Second line: Frame number 第二行:帧号
  • 3rd~328th line: 1st column of input file, 2nd column of input file, number of non-zeros of 3th~12th column of input, number of zeros of 3th~12th column of input, and 13th column of input. 第3〜328行:输入文件的第1列,输入文件的第2列,输入的第3〜12列的非零数目,输入的第3〜12列的零数目和输入的第13列。
  • From 4th line: repeating format, same with above. 从第4行开始:重复格式,与上面相同。

So, the result file have 2 header line, and analyzed data of 326 lines, total 328 line per each frame. 因此,结果文件具有2个标题行,并分析了326行的数据,每帧总共328行。 Same format repeats for next frame too. 下一个帧也重复相同的格式。 Using that format of result data (5 spaces each) is recommended to use the file for other purpose. 建议使用该格式的结果数据(每个5个空格)将文件用于其他目的。

The way I'm using is, Creating 13 arrays for 13 columns -> store data using double for loops for each frame, and each 328 lines. 我使用的方式是为13列创建13个数组->使用double for循环为每帧和每328行存储数据。 But I have no idea how can I deal with output. 但是我不知道如何处理输出。

Following is the my trial code (unfinished, only for read the input), but this code have a lot of problems. 以下是我的试用代码(未完成,仅用于读取输入),但是此代码有很多问题。 Linecache reads whole line, not the first number of every first line. Linecache读取整行,而不是每第一行的第一个数字。 Every frame have 326+3=329 lines, but it seems like my code is not properly working for frame-wise workings. 每帧有326 + 3 = 329行,但是看来我的代码无法正确地用于逐帧工作。 I welcomes any help and assist to analyze this data. 我欢迎任何帮助和协助来分析这些数据。 Thank you very much in advance. 提前非常感谢您。

# Read the file
filename = raw_input("Enter the file name \n")
file = open(filename, 'r')

# Read the number of atom from header
import linecache
nnn = linecache.getline(filename, 1)
natoms = int(nnn)
singleframe = natoms + 3

# get number of frames
nlines = 0
for i1 in file:
    nlines = nlines +1
file.close()

nframes = nlines / singleframe

print 'no of lines are: ', nlines
print 'no of frames are: ', nframes
print 'no of atoms are:', natoms

# Create 1d string array
nrange = range(nlines)
data_lines = [None]*(nlines)

# Store whole input file into string array
file = open(filename, 'r')
i1=0
for i1 in nrange:
    data_lines[i1] = file.readline()
file.close()


# Create 1d array to store atomic data
at_index = [None]*natoms
at_type = [None]*natoms
n1 = [None]*natoms
n2 = [None]*natoms
n3 = [None]*natoms
n4 = [None]*natoms
n5 = [None]*natoms
n6 = [None]*natoms
n7 = [None]*natoms
n8 = [None]*natoms
n9 = [None]*natoms
n10 = [None]*natoms
molnr = [None]*natoms

nrange1= range(natoms)
nframe = range(nframes)

file = open('output_force','w')
print data_lines[9]
for j1 in nframe:
    start = j1*(natoms + 3) + 3
    for i1 in nrange1:
        line = data_lines[i1+start].split()  #Split each line based on spaces
        at_index[i1] = int(line[0])
        at_type[i1] = int(line[1])
        n1[i1]= int(line[2])
        n2[i1]= int(line[3])
        n3[i1]= int(line[4])
        n4[i1]= int(line[5])
        n5[i1]= int(line[6])
        n6[i1]= int(line[7])
        n7[i1]= int(line[8])
        n8[i1]= int(line[9])
        n9[i1]= int(line[10])
        n10[i1]= int(line[11])
        molnr[i1]= int(line[12])

When you are working with csv files, you should look into the csv module . 使用csv文件时,应查看csv模块 I wrote a code that are should do the trick. 我写了一个应该可以解决问题的代码。

This code assumes "good data". 该代码假定“良好数据”。 If your data set may contain errors (such as less columns than 13, or less data rows than 326) some alterations should be done. 如果您的数据集可能包含错误(例如,列少于13个,或数据行少于326个),则应进行一些更改。

(changed to comply with Python 2.6.6) (已更改为符合Python 2.6.6)

import csv
with open('mydata.csv') as in_file:
    with open('outfile.csv', 'wb') as out_file:
        csv_reader = csv.reader(in_file, delimiter=' ', skipinitialspace=True)
        csv_writer = csv.writer(out_file, delimiter = '\t')

        # Iterate over all rows in the file
        for i, header in enumerate(csv_reader):
            # Get the header data
            num = header[0]
            csv_writer.writerow([num])

            # Write frame number, starting with 1 (hence the +1 part)
            csv_writer.writerow([i+1])

            # Iterate over all data rows
            for _ in xrange(326):

                # Call next(csv_reader) to get the next row
                # Put inside a try ... except to avoid StopIteration exception
                # if end of file is found before reaching 326 lines
                try:
                    row = next(csv_reader)
                except StopIteration:
                    break
                # Use list comprehension to extract number of zeros
                zeros = sum([1 for x in row[2:12] if x.strip() == '0'])
                not_zeros = 10 - zeros
                # Write the data to output file
                out = [row[0].strip(), row[1].strip(),not_zeros, zeros, row[12].strip()]
                csv_writer.writerow(out)
            # If the
            else:
                # Skip the last two lines of the file
                next(csv_reader)
                next(csv_reader)

For the first three lines, this yields: 对于前三行,将得出:

326
1
1   6   5   5   1
2   6   4   6   1
3   6   4   6   1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM