简体   繁体   English

Python:发出多次从文件读取数据行的问题

[英]Python: Issue reading data lines multiple times from a file

I am trying to make a Python2.6 script on a Win32 that will read all the text files stored in a directory and print only the lines containing actual data. 我正在尝试在Win32上制作Python2.6脚本,该脚本将读取目录中存储的所有文本文件,并仅打印包含实际数据的行。 A sample file - 样本文件-

Set : 1 
Date: 10212009 
12 34 56 
25 67 90
End Set 
+++++++++
Set: 2 
Date: 10222009 
34 56 89 
25 67 89 
End Set

In the above example file, I want to print only the lines 3, 4 and 9, 10 (the actual data values). 在上面的示例文件中,我只想打印第3、4和9、10行(实际数据值)。 The program does this iteratively on all txt files. 该程序对所有txt文件进行迭代处理。 I wrote the script as below and am testing it on a single txt file as I go. 我按如下方式编写了脚本,并在进行测试时在单个txt文件上对其进行了测试。 My logic is to read the input files one by one and search for a start string. 我的逻辑是逐个读取输入文件并搜索开始字符串。 As soon as the match is found, start searching for end string. 找到匹配项后,立即开始搜索结束字符串。 when both are found, print the lines from start string to end string.Repeat on the rest of the file before opening another file. 如果找到两者,则打印从开始字符串到结束字符串的行。在打开另一个文件之前,请在文件的其余部分上重复此行。 The problem I am having is that it successfully reads the Set 1 of data, but then screws up on subsequent sets in the file. 我遇到的问题是,它成功读取了数据集1,但随后将文件中的后续数据集搞砸了。 For set 2, it identifies the no. 对于组2,它标识编号。 of lines to read, but prints them starting at incorrect line number. 读取的行数,但从不正确的行号开始打印。 A little digging leads to following explanations - 1. Using seek and tell to reposition the 2nd iteration of the loop, which did not work since the file is read from buffer and that screws up "tell" value. 进行一些细致的研究可以得出以下解释-1.使用头并告诉重新定位循环的第二次迭代,由于从缓冲区读取了文件并弄乱了“ tell”值,因此该迭代不起作用。 2. Opening the file in binary mode helped someone, but it is not working for me. 2.以二进制模式打开文件对某人有帮助,但是它对我不起作用。 3. Open the file with 0 buffer mode, but it did not work. 3.使用0缓冲模式打开文件,但是没有用。

Second problem I am having is when it prints data from Set 1, it inserts a blank line between 2 lines of data values. 我遇到的第二个问题是,当它打印Set 1中的数据时,它在两行数据值之间插入了空白行。 How can I get rid of it? 我该如何摆脱呢?

Note: Ignore all references to next_run in the code below. 注意:忽略以下代码中对next_run的所有引用。 I was trying it out for repositioning line read. 我正在尝试重新定位读取的行。 Subsequent searches for start string should begin from the last position of end string 随后搜索起始字符串应从结束字符串的最后位置开始

#!C:/Python26 python 

# Import necessary modules 
import os, glob, string, sys, fileinput, linecache 
from goto import goto, label 

# Set working path 
path = 'C:\\System_Data' 


# -------------------- 
# PARSE DATA MODULE 
# -------------------- 

# Define the search strings for data 
start_search = "Set :" 
end_search ="End Set" 
# For Loop to read the input txt files one by one 
for inputfile in glob.glob( os.path.join( path, '*.txt' ) ): 
  inputfile_fileHandle = open ( inputfile, 'rb', 0 ) 
  print( "Current file being read: " +inputfile ) 
  # start_line initializes to first line 
  start_line = 0 
  # After first set of data is extracted, next_run will store the position to read the rest of the file 
  # next_run = 0 
  # start reading the input files, one line by one line 
  for line in inputfile: 
    line = inputfile_fileHandle.readline() 
    start_line += 1 
    # next_run+=1 
    # If a line matched with the start_search string 
    has_match = line.find( start_search ) 
    if has_match >= 0: 
      print ( "Start String found at line number: %d" %( start_line ) ) 
      # Store the location where the search will be restarted 
      # next_run = inputfile_fileHandle.tell() #inputfile_fileHandle.lineno() 
      print ("Current Position: %d" % next_run) 
      end_line = start_line 
      print ( "Start_Line: %d" %start_line ) 
      print ( "End_Line: %d" %end_line ) 
      #print(line) 
      for line in inputfile: 
        line = inputfile_fileHandle.readline() 
        #print (line) 
        end_line += 1 
        has_match = line.find(end_search) 
        if has_match >= 0: 
          print 'End   String found at line number: %d' % (end_line) 
          # total lines to print: 
          k=0 
          # for loop to print all the lines from start string to end string 
          for j in range(0,end_line-start_line-1): 
            print linecache.getline(inputfile, start_line +1+ j ) 
            k+=1 
          print ( "Number of lines Printed: %d " %k ) 
          # Using goto to get out of 2 loops at once 
          goto .re_search_start_string 
    label .re_search_start_string 
    #inputfile_fileHandle.seek(next_run,0) 

  inputfile_fileHandle.close ()
in_data = False
for line in open( 'data.txt' ):
    if line.startswith( 'Date:' ):
        in_data = True
    elif line.startswith( 'End Set' ):
        in_data = False
    elif in_data:
        print line.rstrip()

Just put something like that inside a loop over your files (ie os.walk) and you should be good to go 只需将类似的内容放入文件循环(即os.walk)中,您应该会很好

I'd probably do something much simpler, like this: 我可能会做一些更简单的事情,像这样:

import glob, os

start_search = "Set :" 
end_search = "End Set" 
path = '.'

for filename in glob.glob(os.path.join(path, '*.txt')): 
 inputfile = open(filename, 'rb', 0)
 print("Current file being read: " + filename)
 is_in_set = False
 while True:
  line = inputfile.readline()
  if not line: break
  if line.startswith(start_search):
   is_in_set = True
   inputfile.readline() # Discard the next line.
  elif line.startswith(end_search):
   is_in_set = False
   print('---')
  elif is_in_set:
   print(line.rstrip()) # The rstrip removes the extra blank lines.

If you also want the line numbers, wrap the file object so that every time you call readline() it counts the line numbers. 如果您还需要行号,请包装文件对象,以便每次调用readline()时都对行号进行计数。

The first thing I'd do is build a generator that uses a simple state machine to extract the data from a sequence: 我要做的第一件事是构建一个生成器,该生成器使用简单的状态机从序列中提取数据:

def extract_data(seq):
    state = "Waiting"
    for item in seq:
        if state == "Waiting":
            if item.startswith("Set"):
                state = "SkippingDateLine"
                continue
            if state == "SkippingDateLine":
                state = "EmittingData"
                continue
            if state == "EmittingData":
                if item.startswith("End Set"):
                    state = "Waiting"
                    continue
                yield item.rstrip()

Now I can test that generator to see if it really does what I think it's doing: 现在,我可以测试该生成器,以查看它是否确实执行了我认为的工作:

>>> data = """Set : 1 
Date: 10212009 
12 34 56 
25 67 90
End Set 
+++++++++
Set: 2 
Date: 10222009 
34 56 89 
25 67 89 
End Set""".split("\n")

>>> print list(extract_data(data))
['12 34 56', '25 67 90', '34 56 89', '25 67 89']

From here it's simple to make a generator that yields the data from a file given its name: 从这里开始,很容易制作一个生成器,该生成器从给定名称的文件中生成数据:

def extract_data_from_file(filename):
    with open(filename, 'rb') as f:
        for item in extract_data(f):
            yield item

...and to test it: ...并对其进行测试:

>>> list(extract_data_from_file(r'c:\temp\test\test1.txt'))
['12 34 56', '25 67 90', '34 56 89', '25 67 89']

Now build a generator that goes through all the text files in a directory: 现在构建一个生成器,该生成器遍历目录中的所有文本文件:

def extract_data_from_directory(path):
    for filename in os.listdir(path):
        if filename.endswith('.txt'):
            fullname = os.path.join(path, filename)
                for item in extract_data_from_file(fullname):
                yield item

...and then, after making a copy of test1.txt , test it: ...然后,在复制test1.txt ,对其进行测试:

>>> list(extract_data_from_directory(r'c:\temp\test'))
['12 34 56', '25 67 90', '34 56 89', '25 67 89', '12 34 56', '25 67 90', '34 56 89', '25 67 89']

Follow the presentation slides from David M. Beazley at (if you are certain you want to skip the intro, start at page 18): http://www.dabeaz.com/generators/ 遵循David M. Beazley的演示幻灯片, 网址为(如果您确定要跳过介绍,请从第18页开始): http : //www.dabeaz.com/generators/

That is undisputably (in my opinion :) the very best way to solve what you are trying to achieve. 毫无疑问,这是解决您要达到的目标的最佳方法。

Basically, the thing you want is generators , and os.walk. 基本上,您想要的是generators和os.walk。 A snip from Beazleys code: 比兹利(Beazleys)代码的片段:

import os
import fnmatch
import re
import gzip, bz2

def gen_find(filepat,top):
    for path, dirlist, filelist in os.walk(top):
        for name in fnmatch.filter(filelist,filepat):
            yield os.path.join(path,name)

def gen_open(filenames):
    for name in filenames:
        if name.endswith(".gz"):
            yield gzip.open(name)
        elif name.endswith(".bz2"):
            yield bz2.BZ2File(name)
        else:
            yield open(name)

def gen_cat(sources):
    for s in sources:
        for item in s:
            yield item

def gen_grep(pat, lines):
    patc = re.compile(pat)
        for line in lines:
            if patc.search(line): yield line

lognames = gen_find("access-log*", "/usr/www")
logfiles = gen_open(lognames)
loglines = gen_cat(logfiles)
patlines = gen_grep(pat, loglines)
# in your example you could set pat as "^[\d ]+$"

You have many issues here. 这里有很多问题。

  for line in inputfile: 
    line = inputfile_fileHandle.readline() 

inputfile is the name of your file, so this loop is going to be executed once for each character in the name of the file, definitely not what you want. inputfile是文件的名称,因此将对文件名中的每个字符执行一次此循环,这绝对不是您想要的。 You do this twice (nested), so for sure too many lines are being read. 您执行两次此操作(嵌套),因此请确保读取了太多行。

The goto module was a joke. goto模块是个玩笑。 Get rid of it. 摆脱它。

When you open the file, don't add 'b' to the mode. 打开文件时,请勿在模式中添加“ b”。 These are text files, open them as text, not binary. 这些是文本文件,以文本而不是二进制形式打开。

I'd probably do something even easier: 我可能会做些更轻松的事情:

grep -E '[0-9][0-9] [0-9][0-9] [0-9][0-9]' *.txt

grep is available on Win32 grep在Win32上可用

Several errors: 几个错误:

for line in inputfile: 

inputfile is your filename. inputfile是您的文件名。 So the for loop will iterate over each character of the filename. 因此, for循环将遍历文件名的每个字符。

You need to do 你需要做

for line in inputfile_fileHandle:

Then, line already contains your current line. 然后,该line已包含您当前的行。 Also, you probably don't need to open the file using 'rb' since it's a text file. 另外,由于它是文本文件,因此您可能不需要使用'rb'打开文件。

Then you nest an identical for loop inside the first loop (which is currently also completely wrong, iterating over the filename once again). 然后for在第一个循环中嵌套一个相同的for循环(当前这也是完全错误的,再次遍历文件名)。

Not to mention the goto/label nonsense :) 更不用说goto / label废话了:)

kurosch has written a good version. kurosch写了一个很好的版本。

f=0
for line in open("file"):    
    if "End Set" in line: f=0
    if "Date" in line: f=1
    elif f: print line.strip()   

output 输出

$ ./python.py
12 34 56
25 67 90
34 56 89
25 67 89

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM