简体   繁体   English

在文本文件中的行之间读取

[英]Read between lines in text file

First of all, the contents of my example text file looks like this: 首先,我的示例文本文件的内容如下所示:

Some Data
Nothing important
Start here
This is important
Grab this line too
And this ono too
End here
Text goes on, but isn't important
Next text
Blaah

And now, I want to read in the text file, and I only want to grab lines between " Start here " and " End here ". 现在,我想读入文本文件,并且只想在“ 从这里开始 ”和“ 从这里 结束 ”之间抓线。

So my Python code looks like: 所以我的Python代码看起来像:

filename = 'example_file.txt'

with open(filename, 'r') as input:
   for line in input: # First loop breaks at specific line
       if 'Start here' in line:
           break

   for line_1 in input: # Second loop grabs all lines
       print line_1.strip()

   for line_2 in input: # Third loop breaks at specific line
       if 'End here' in line_2:
           break

But it doesn't work. 但这是行不通的。

Here my output, when I run it: 这是我运行时的输出:

This is important
Grab this line too
And this on too
End here
Text goes on, but isn't important
Next text
Blaah

As you can see, my script doesn't break at End here . 如您所见,我的脚本没有在End here中断。 The program begins at the correct line, but its doesn't break at the correct line. 程序从正确的行开始,但没有在正确的行中断。

What's wrong? 怎么了?

It's the second loop that needs the break... 这是需要休息的第二个循环...

for line_1 in input:
    if 'End here' in line_1:
        break
    print line_1.strip()

Your problem is that you should be checking for 'End Here' in your second loop, as the second and third one don't run at the same time. 您的问题是您应该在第二个循环中检查“ End Here”,因为第二个和第三个循环不会同时运行。 In fact, the third loop won't even run. 实际上,第三个循环甚至不会运行。

With that in mind, this code will work: 考虑到这一点,此代码将起作用:

filename = 'mydata.txt'

with open(filename, 'r') as f:
    for line in f:
        if 'Start here' in line:
            break

    for line_1 in f:
        if 'End here' in line:
            break
        else:
            print line.strip()

However, there are still some optimizations we can make: 但是,我们仍然可以进行一些优化:

  • Variables on for loops are local to that for loop only, so we can reuse the name; for循环上的变量仅是for循环的局部变量,因此我们可以重用该名称;
  • any code after break won't run anyways, so we can get rid of the else ; break后的任何代码都不会运行,因此我们可以摆脱else
  • open uses read-mode by default. open默认情况下使用读取模式。

With this in mind, your final code would look like this: 考虑到这一点,您的最终代码将如下所示:

filename = 'mydata.txt'

with open(filename) as f:
    for line in f:
        if 'Start here' in line:
            break

    for line in f:
        if 'End here' in line:
            break
        print line.strip()

Run that, and you'll get the desired output: 运行它,您将获得所需的输出:

This is important
Grab this line too
And this ono too

You can use regular expressions ( re module) with the re.DOTALL option so that newlines are considered as regular characters. 您可以使用带有re.DOTALL选项的正则表达式( re模块),以便将换行符视为正则字符。

import re

source = """Some Data
Nothing important
Start here
This is important
Grab this line too
And this ono too
End here
Text goes on, but isn't important
Next text
Blaah"""

# or else:
# source = open(filename, 'r').read() # or similar

result = re.search("Start here(.*)End here", source, re.DOTALL).group(1).strip()

print result

> This is important
> Grab this line too
> And this ono too

Why it works: 工作原理:

  • re.search looks for the pattern in some string; re.search在某些字符串中寻找模式;
  • Parentheses separates the matches in groups . 括号将匹配项分组 The first group is the whole pattern, the second group is the parentheses. 第一组是整个模式,第二组是括号。 Groups can be sequenced and nested; 组可以排序和嵌套;
  • .* means "any char, any number of times". .*表示“任何字符,任何次数”。 It is required to take everything between the two hard-coded markers (namely Start Here and End here ); 需要在两个硬编码的标记之间进行所有操作(即Start HereEnd here );
  • re.DOTALL is the secret: it will treat newline chars as if they were regular string chars. re.DOTALL是个秘密:它将换行符视为常规字符串字符。 Dot is the symbol for "any char", so "dot all" means "treat any char as a regular char, even new-line chars". 点是“任何字符”的符号,因此“全部点”的意思是“将任何字符作为常规字符,甚至换行字符进行处理”。
  • group(1) means you want the second (zero-based index) group, which is the one inside the parentheses. group(1)表示您想要第二个(从零开始的索引)组,它是括号内的组。

You can read all lines first and enumerate it: 您可以先阅读所有行并进行枚举:

filename = 'example_file.txt'

useful_content = []
with open(filename, 'r') as input:
    all_lines = input.readlines()  # read all lines
    for idx in range(len(all_lines)):  # iterate all lines
    if 'Start here' in all_lines[idx]:
        useful_content.append(all_lines[idx].strip())
        idx = idx + 1
        # found start of useful contents, continue iterate till it ends
        while 'End here' not in all_lines[idx]:
            useful_content.append(all_lines[idx].strip())
            idx = idx + 1
        break
for line in useful_content:
    print(line)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM