简体   繁体   English

解析字符串模式 (Python)

[英]Parsing a string pattern (Python)

I have a file with following data:我有一个包含以下数据的文件:

<<row>>12|xyz|abc|2.34<</row>>
<<eof>>

The file may have several rows like this.该文件可能有几行这样。 I am trying to design a parser which will parse each row present in this file and return an array with all rows.我正在尝试设计一个解析器,它将解析此文件中存在的每一行并返回一个包含所有行的数组。 What would be the best way of doing it?最好的方法是什么? The code has to be written in python.代码必须用python编写。 Code should not take rows that do not start with <<row>> or should raise error.代码不应采用不以<<row>>开头或应引发错误的行。

=======> UPDATE <======== ========> 更新 <========

I just found that a particular <<row>> can span multiple lines.我刚刚发现一个特定的<<row>>可以跨越多行。 So my code and the code present below aren't working anymore.所以我的代码和下面的代码不再工作了。 Can someone please suggest an efficient solution?有人可以建议一个有效的解决方案吗?

The data files can contain hundreds to several thousands of rows.数据文件可以包含数百到数千行。

def parseFile(fileName):
  with open(fileName) as f:

    def parseLine(line):
      m = re.match(r'<<row>>(\d+)\|(\w+)\|(\w+)\|([\d\.]+)<</row>>$', line)
      if m:
        return m.groups()

    return [ values for values in (
      parseLine(line)
        for line in f
        if line.startswith('<<row>>')) if values ]

And?和? Am I different?我不一样吗? ;-) ;-)

A simple way without regular expressions:没有正则表达式的简单方法:

output = []
with open('input.txt', 'r') as f:
    for line in f:
        if line == '<<eof>>':
            break
        elif not line.startswith('<<row>>'):
            continue
        else:
            output.append(line.strip()[7:-8].split('|'))

This uses every line starting with <<row>> until a line contains only <<eof>>这使用以<<row>>开头的每一行,直到一行只包含<<eof>>

A good practice is to test for unwanted cases and ignore them.一个好的做法是测试不需要的情况并忽略它们。 Once you are sure that you have a compliant line, you process it.一旦您确定您有一条合规的生产线,您就可以对其进行处理。 Note that the actual processing is not in an if statement.请注意,实际处理不在 if 语句中。 Without rows split across several lines, you need only two tests:如果行不分成多行,您只需要两个测试:

rows = list()
with open('newfile.txt') as file:
    for line in file.readlines():
        line = line.strip()
        if not line.startswith('<<row>>'):
            continue
        if not line[-8:] == '<</row>>':
            continue
        row = line[7:-8]
        rows.append(row)

With rows split across several lines, you need to save the previous line in some situations:将行拆分为多行,在某些情况下您需要保存前一行:

rows = list()
prev = None
with open('newfile.txt') as file:
    for line in file.readlines():
        line = line.strip()
        if not line.startswith('<<row>>') and prev is not None:
            line = prev + line
        if not line.startswith('<<row>>'):
            continue
        if not line[-8:] == '<</row>>':
            prev = line
            continue
        row = line[7:-8]
        rows.append(row)
        prev = None

If needed, you can split columns with: cols = row.split('|')如果需要,您可以使用以下方法拆分列: cols = row.split('|')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM