简体   繁体   English

使用python将特定行从一个文件写入另一个文件

[英]Using python to write specific lines from one file to another file

I have ~200 short text files (50kb) that all have a similar format. 我有~200个短文本文件(50kb),它们都有类似的格式。 I want to find a line in each of those files that contains a certain string and then write that line plus the next three lines (but not rest of the lines in the file) to another text file. 我想在每个包含特定字符串的文件中找到一行,然后将该行加上接下来的三行(但不是文件中的其余行)写入另一个文本文件。 I am trying to teach myself python in order to do this and have written a very simple and crude little script to try this out. 我正在尝试自学python以便做到这一点,并编写了一个非常简单粗暴的小脚本来试试这个。 I am using version 2.6.5, and running the script from Mac terminal: 我使用的是2.6.5版,并从Mac终端运行脚本:

#!/usr/bin/env python

f = open('Test.txt')

Lines=f.readlines()
searchquery = 'am\n'
i=0

while i < 500:
    if Lines[i] == searchquery:
        print Lines[i:i+3]
        i = i+1
    else:
        i = i+1
f.close()

This more or less works and prints the output to the screen. 这或多或少有效并将输出打印到屏幕上。 But I would like to print the lines to a new file instead, so I tried something like this: 但我想将行打印到一个新文件,所以我试过这样的事情:

f1 = open('Test.txt')
f2 = open('Output.txt', 'a')

Lines=f1.readlines()
searchquery = 'am\n'
i=0

while i < 500:
if Lines[i] == searchquery:
    f2.write(Lines[i])
    f2.write(Lines[i+1])
    f2.write(Lines[i+2])
    i = i+1
else:
    i = i+1
f1.close()
f2.close()

However, nothing is written to the file. 但是,没有任何内容写入文件。 I also tried 我也试过了

from __future__ import print_function
print(Lines[i], file='Output.txt')

and can't get that to work, either. 而且也无法让它发挥作用。 If anyone can explain what I'm doing wrong or offer some suggestions about what I should try instead I would be really grateful. 如果有人能够解释我做错了什么或提出一些关于我应该尝试的建议,我将非常感激。 Also, if you have any suggestions for making the search better I would appreciate those as well. 此外,如果您有任何建议使搜索更好,我也会很感激。 I have been using a test file where the string I want to find is the only text on the line, but in my real files the string that I need is still at the beginning of the line but followed by a bunch of other text, so I think the way I have things set up now won't really work, either. 我一直在使用一个测试文件,其中我想要找到的字符串是该行上唯一的文本,但在我的真实文件中,我需要的字符串仍然在行的开头,但后面是一堆其他文本,所以我认为我现在设置的方式也不会真正起作用。

Thanks, and sorry if this is a super basic question! 谢谢,对不起,如果这是一个超级基本的问题!

As pointed out by @ajon, I don't think there's anything fundamentally wrong with your code except the indentation. 正如@ajon指出的那样,除了缩进之外,我认为你的代码没有任何根本性的错误。 With the indentation fixed it works for me. 随着缩进修复它对我有用。 However there's a couple opportunities for improvement. 然而,有几个改进的机会。

1) In Python, the standard way of iterating over things is by using a for loop . 1)在Python中,迭代事物的标准方法是使用for循环 When using a for loop, you don't need to define loop counter variables and keep track of them yourself in order to iterate over things. 使用for循环时,您不需要定义循环计数器变量并自己跟踪它们以迭代事物。 Instead, you write something like this 相反,你写这样的东西

for line in lines:
    print line

to iterate over all the items in a list of strings and print them. 迭代字符串列表中的所有项目并打印它们。

2) In most cases this is what your for loops will look like. 2)在大多数情况下,这就是你的for循环的样子。 However, there's situations where you actually do want to keep track of the loop count. 但是,在某些情况下,您确实希望跟踪循环计数。 Your case is such a situation, because you not only need that one line but also the next three, and therefore need to use the counter for indexing ( lst[i] ). 您的情况就是这种情况,因为您不仅需要一行而且需要接下来的三行,因此需要使用计数器进行索引( lst[i] )。 For that there's enumerate() , which will return a list of items and their index over which you then can loop. 对于有enumerate()将返回的项目清单价格指数比,你可以再循环。

for i, line in enumerate(lines):
    print i
    print line
    print lines[i+7]

If you were to manually keep track of the loop counter as in your example, there's two things: 如果您要像示例中那样手动跟踪循环计数器,则有两件事:

3) That i = i+1 should be moved out of the if and else blocks. 3) i = i+1应该移出ifelse块。 You're doing it in both cases, so put it after the if/else . 你在两种情况下都这样做,所以把它放在if/else In your case the else block then doesn't do anything any more, and can be eliminated: 在你的情况下, else块不再做任何事情,可以消除:

while i < 500:
    if Lines[i] == searchquery:
        f2.write(Lines[i])
        f2.write(Lines[i+1])
        f2.write(Lines[i+2])
    i = i+1

4) Now, this will cause an IndexError with files shorter than 500 lines. 4)现在,这将导致IndexError的文件短于500行。 Instead of hard coding a loop count of 500, you should use the actual length of the sequence you're iterating over. 您应该使用要迭代的序列的实际长度,而不是将循环计数硬编码为500。 len(lines) will give you that length. len(lines)会给你那个长度。 But instead of using a while loop, use a for loop and range(len(lst)) to iterate over a list of the range from zero to len(lst) - 1 . 但是不使用while循环,而是使用for循环和range(len(lst))迭代范围从0到len(lst) - 1

for i in range(len(lst)):
    print lst[i]

5) open() can be used as a context manager that takes care of closing files for you. 5) open()可以用作上下文管理器 ,负责为您关闭文件。 context managers are a rather advanced concept but are pretty simple to use if they're already provided for you. 上下文管理器是一个相当先进的概念,但如果它们已经为您提供,则使用它们非常简单。 By doing something like this 通过做这样的事情

with open('test.txt') as f:
    f.write('foo')

the file will be opened and accessible to you as f inside that with block. 该文件将被打开,并作为访问你f里面是with块。 After you leave the block the file will be automatically closed, so you can't end up forgetting to close the file. 离开块后,文件将自动关闭,因此您最终忘记关闭文件。

In your case you're opening two files. 在你的情况下,你打开两个文件。 This can be done by just using two with statements and nest them 这可以通过使用两个with语句并嵌套它们来完成

with open('one.txt') as f1:
    with open('two.txt') as f2:
        f1.write('foo')
        f2.write('bar')

or, in Python 2.7 / Python 3.x, by nesting two context manager in a single with statement: 或者,在Python 2.7 / Python 3.x中,通过在单个with语句中嵌套两个上下文管理器:

    with open('one.txt') as f1, open('two.txt', 'a') as f2:
        f1.write('foo')
        f2.write('bar')

6) Depending on the operating system the file was created on, line endings are different. 6)根据创建文件的操作系统,行结尾不同。 On UNIX-like platforms it's \\n , Macs before OS X used \\r , and Windows uses \\r\\n . 在类UNIX平台是\\n ,苹果OS X使用前\\r ,和Windows使用\\r\\n So that Lines[i] == searchquery will not match for Mac or Windows line endings. 因此,对于Mac或Windows行结尾, Lines[i] == searchquery将不匹配。 file.readline() can deal with all three, but because it keeps whatever line endings were there at the end of the line, the comparison will fail. file.readline()可以处理所有三个,但因为它保留了行尾的任何行结尾,所以比较将失败。 This is solved by using str.strip() , which will strip the string of all whitespace at the beginning and the end, and compare a search pattern without the line ending to that: 这是通过使用str.strip()解决的, str.strip()将在开头和结尾处str.strip()所有空格的字符串,并比较没有行结束的搜索模式:

searchquery = 'am'
# ...
            if line.strip() == searchquery:
                # ...

(Reading the file using file.read() and using str.splitlines() would be another alternative.) (使用file.read()读取文件并使用str.splitlines()将是另一种选择。)

But, since you mentioned your search string actually appears at the beginning of the line, lets do that, by using str.startswith() : 但是,既然你提到你的搜索字符串实际出现在行的开头,那么让我们使用str.startswith()

if line.startswith(searchquery):
    # ...

7) The official style guide for Python, PEP8 , recommends to use CamelCase for classes, lowercase_underscore for pretty much everything else (variables, functions, attributes, methods, modules, packages). 7) Python的官方样式指南, PEP8 ,建议使用CamelCase用于类, lowercase_underscore用于其他所有内容(变量,函数,属性,方法,模块,包)。 So instead of Lines use lines . 因此,而不是Lines使用lines This is definitely a minor point compared to the others, but still worth getting right early on. 与其他人相比,这绝对是一个小问题,但仍值得尽早开始。


So, considering all those things I would write your code like this: 所以,考虑到所有这些事情我会写这样的代码:

searchquery = 'am'

with open('Test.txt') as f1:
    with open('Output.txt', 'a') as f2:
        lines = f1.readlines()
        for i, line in enumerate(lines):
            if line.startswith(searchquery):
                f2.write(line)
                f2.write(lines[i + 1])
                f2.write(lines[i + 2])

As @TomK pointed out, all this code assumes that if your search string matches, there's at least two lines following it. 正如@TomK指出的那样,所有这些代码都假设如果你的搜索字符串匹配,那么它后面至少有两行。 If you can't rely on that assumption, dealing with that case by using a try...except block like @poorsod suggested is the right way to go. 如果你不能依赖这个假设,那么通过使用try...except处理那个案例, try...except @poorsod建议的块try...except是正确的方法。

I think your problem is the tabs of the bottom file. 我认为你的问题是底部文件的标签。

You need to indent from if Lines[i] until after i=i+1 such as: 你需要缩进if Lines[i]直到i=i+1例如:

while i < 500:
    if Lines[i] == searchquery:
        f2.write(Lines[i])
        f2.write(Lines[i+1])
        f2.write(Lines[i+2])
        i = i+1
    else:
        i = i+1

ajon has the right answer, but so long as you are looking for guidance, your solution doesn't take advantage of the high-level constructs that Python can offer. ajon有正确的答案,但只要您正在寻找指导,您的解决方案就不会利用Python可以提供的高级构造。 How about: 怎么样:

searchquery = 'am\n'

with open('Test.txt') as f1:
  with open(Output.txt, 'a') as f2:

    Lines = f1.readlines()

    try:
      i = Lines.index(searchquery)
      for iline in range(i, i+3):
        f2.write(Lines[iline])
    except:
      print "not in file"

The two "with" statements will automatically close the files at the end, even if an exception happens. 即使发生异常,两个“with”语句也会自动关闭文件的末尾。

A still better solution would be to avoid reading in the whole file at once (who knows how big it could be?) and, instead, process line by line, using iteration on a file object: 一个更好的解决方案是避免一次读取整个文件(谁知道它有多大?),而是逐行处理,使用文件对象的迭代:

  with open('Test.txt') as f1:
    with open(Output.txt, 'a') as f2:
      for line in f1:
        if line == searchquery:
          f2.write(line)
          f2.write(f1.next())
          f2.write(f1.next())

All of these assume that there are at least two additional lines beyond your target line. 所有这些都假设您的目标线之外至少还有两条线。

Have you tried using something other than 'Output.txt' to avoid any filesystem related issues as the problem? 您是否尝试使用“Output.txt”以外的其他内容来避免任何与文件系统相关的问题?

What about an absolute path to avoid any funky unforeseen problems while diagnosing this. 如何在诊断时避免任何时髦无法预料的问题。

This advice is simply from a diagnostic standpoint. 这个建议只是从诊断的角度出发。 Also check out the the OS X dtrace and dtruss. 另请查看OS X dtrace和dtruss。

See: Equivalent of strace -feopen < command > on mac os X 请参阅: mac os X上的strace -feopen <command>的等效项

Writing line by line can be slow when working with large data. 使用大数据时,逐行写入可能会很慢。 You can accelerate the read/write operations by reading/writing a bunch of lines at once. 您可以通过一次读/写一堆行来加速读/写操作。

from itertools import slice

f1 = open('Test.txt')
f2 = open('Output.txt', 'a')

bunch = 500
lines = list(islice(f1, bunch)) 
f2.writelines(lines)

f1.close()
f2.close()

In case your lines are too long and depending on your system, you may not be able to put 500 lines in a list. 如果您的线路太长并且取决于您的系统,您可能无法在列表中放置500行。 If that's the case, you should reduce the bunch size and have as many read/write steps as needed to write the whole thing. 如果是这种情况,您应该减少bunch大小并根据需要执行尽可能多的读/写步骤来编写整个内容。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM