Python从文件中提取数据

Question

I am trying to extract the text between that has specific text file: 我试图提取具有特定文本文件之间的文本：

----
data1
data1
data1
extractme
----
data2
data2
data2
----
data3
data3
extractme
----

and then dump it to text file so that 然后将其转储到文本文件，以便

----
data1
data1
data1
extractme
---
data3
data3
extractme
---

Thanks for the help. 谢谢您的帮助。

Answer 1

This works well enough for me. 这对我来说足够好了。 Your sample data is in a file called "data.txt" and the output goes to "result.txt" 您的样本数据位于名为“ data.txt”的文件中，输出将进入“ result.txt”

inFile = open("data.txt")
outFile = open("result.txt", "w")
buffer = []
keepCurrentSet = True
for line in inFile:
    buffer.append(line)
    if line.startswith("----"):
        #---- starts a new data set
        if keepCurrentSet:
            outFile.write("".join(buffer))
        #now reset our state
        keepCurrentSet = False
        buffer = []
    elif line.startswith("extractme"):
        keepCurrentSet = True
inFile.close()
outFile.close()

Answer 2

I imagine the change in number of dashes (4 in the input, sometimes 4 and sometimes 3 in the output) is an error and not actually desired (since no algorithm is even hinted at, to explain how many dashes are to be output on different occasions). 我想象破折号数量的变化（输入中为4，输出中有时为4，有时为3）是错误并且不是实际需要的（因为甚至没有提示算法，以解释在不同点上要输出多少破折号）场合）。

I would structure the task in terms of reading and yielding one block of lines at a time: 我将按照一次读取和产生一行代码的方式来构造任务：

def readbyblock(f):
  while True:
      block = []
      for line in f:
          if line = '----\n': break
          block.append(line)
      if not block: break
      yield block

so that the (selective) output can be neatly separated from the input: 这样（选择性）输出可以与输入整齐地分开：

with open('infile.txt') as fin:
    with open('oufile.txt', 'w') as fou:
        for block in readbyblock(fin):
            if 'extractme\n' in block:
                fou.writelines(block)
                fou.write('----\n')

This is not optimal, performance-wise, if the blocks are large, since it has a separate loop on all lines in the block implied in the if clause. 如果块很大，这不是最佳的性能，因为它在if子句中隐含的块的所有行上都有一个单独的循环。 So, a good refactoring might be: 因此，一个好的重构可能是：

def selectivereadbyblock(f, marker='extractme\n'):
  while True:
      block = []
      extract = False
      for line in f:
          if line = '----\n': break
          block.append(line)
          if line==marker: extract = True
      if not block: break
      if extract: yield block

with open('infile.txt') as fin:
    with open('oufile.txt', 'w') as fou:
        for block in selectivereadbyblock(fin):
            fou.writelines(block)
            fou.write('----\n')

Parameterizing the separators (now hard-coded as '----\\n' for both input and output) is another reasonable coding tweak. 参数化分隔符（现在对于输入和输出都硬编码为'---- \\ n'）是另一个合理的编码调整。

Answer 3

For Python2 对于Python2

#!/usr/bin/env python

with open("infile.txt") as infile:
    with open("outfile.txt","w") as outfile:
        collector = []
        for line in infile:
            if line.startswith("----"):
                collector = []
            collector.append(line)
            if line.startswith("extractme"):
                for outline in collector:
                    outfile.write(outline)

For Python3 对于Python3

#!/usr/bin/env python3

with open("infile.txt") as infile, open("outfile.txt","w") as outfile:
    collector = []
    for line in infile:
        if line.startswith("----"):
            collector = []
        collector.append(line)
        if line.startswith("extractme"):
            for outline in collector:
                outfile.write(outline)

Answer 4

data=open("file").read().split("----")
print '----'.join([ i for i in data if "extractme" in i ])

Python从文件中提取数据

问题描述

4 个解决方案

解决方案1
5 已采纳 2010-03-19 01:30:07

解决方案2
5 2010-03-19 01:41:46

解决方案3
2 2010-03-19 00:22:23

解决方案4
1 2010-03-19 01:44:38

Python从文件中提取数据

问题描述

4 个解决方案

解决方案1 5 已采纳 2010-03-19 01:30:07

解决方案2 5 2010-03-19 01:41:46

解决方案3 2 2010-03-19 00:22:23

解决方案4 1 2010-03-19 01:44:38

解决方案1
5 已采纳 2010-03-19 01:30:07

解决方案2
5 2010-03-19 01:41:46

解决方案3
2 2010-03-19 00:22:23

解决方案4
1 2010-03-19 01:44:38