简体   繁体   English

如何在Python中使用regexp解析此文本文件?

[英]How to parse this text file with regexp, in Python?

I need to parse text file which contains logins and id of users 我需要解析包含登录名和用户ID的文本文件

+----+---------------+---------------+
| Id | Login         | Name          |
+----+---------------+---------------+
| 1  | admin         | admin         |
| 2  | admin2        | admin2        |
| 3  | ekaterina     | Ekaterina     |
| 4  | commarik      | commarik      |
| 5  | basildrescher | BasilDrescher |
| 6  | danielalynn   | DanielaLynn   |
| 7  | rosez13yipfj  | RoseZ13yipfj  |
| 8  | veolanoyes    | VeolaNoyes    |
| 9  | angel         | Angel         |
| 10 | michalea44    | MichaleA44    |
+----+---------------+---------------+

So I use re , like this: 所以我使用re ,像这样:

import re
fh = open('test1.txt')
lines = fh.readlines()
for line in lines:
        #print line
        p = re.compile(r"|(.*?)|")
        m2 = p.search(line)
        if m2:
                print m2.group(0)

The problem is that I can't get needed result! 问题是我无法获得所需的结果! I've tried various combinations with spaces and tabs, but it didn't work. 我已经尝试了使用空格和制表符的各种组合,但是没有用。 I solved this with split() , but I still want to understand where I am wrong. 我用split()解决了这个问题,但是我仍然想了解我错了。 Any help would be appreciated. 任何帮助,将不胜感激。 Thank you! 谢谢!

You have multiple errors: 您有多个错误:

  • The | | is not escaped 不能逃脱
  • You only have one group, so you are extracting only the first column. 您只有一组,因此您仅提取第一列。

The regex should be like this: 正则表达式应如下所示:

\|(.*?)\|(.*?)\|(.*?)\|

You can see a demo here . 您可以在此处查看演示。

If you dont expect fancy data, you can just use word chars and digits. 如果您不希望花哨的数据,则可以只使用字字符和数字。

r"([\\d\\w]+) r“([\\ d \\ w] +)

Sample usage below 下面的示例用法

In [27]: data = """+----+---------------+---------------+
....:     | Id | Login         | Name          |
....:     +----+---------------+---------------+
....:     | 1  | admin         | admin         |
....:     | 2  | admin2        | admin2        |
....:     | 3  | ekaterina     | Ekaterina     |
....:     | 4  | commarik      | commarik      |
....:     | 5  | basildrescher | BasilDrescher |
....:     | 6  | danielalynn   | DanielaLynn   |
....:     | 7  | rosez13yipfj  | RoseZ13yipfj  |
....:     | 8  | veolanoyes    | VeolaNoyes    |
....:     | 9  | angel         | Angel         |
....:     | 10 | michalea44    | MichaleA44    |
....:     +----+---------------+---------------+"""

In [32]: matches = re.findall(r"([\d\w]+)", data)
In [36]: matches
Out[36]: ['Id', 'Login', 'Name', '1', 'admin', 'admin', '2', 'admin2', 'admin2', '3', 'ekaterina', 'Ekaterina', '4', 'commarik', 'commarik', '5', 'basildrescher', 'BasilDrescher', '6', 'danielalynn', 'DanielaLynn', '7', 'rosez13yipfj', 'RoseZ13yipfj', '8', 'veolanoyes', 'VeolaNoyes', '9', 'angel', 'Angel', '10', 'michalea44', 'MichaleA44']

| is a special character in regular expressions for "or"ing two expressions together. 是正则表达式中的一个特殊字符,用于将两个表达式“或”在一起。 You need to escape it as \\| 您需要将其转义为\\| to match the actual character. 匹配实际字符。 Also, search() will find one match. 同样, search()将找到一个匹配项。 You may want to look through other methods such as findall . 您可能需要浏览其他方法,例如findall

Try using this regex to capture each individual line as a separate capture group, according to syntax: 根据语法,尝试使用此正则表达式将每行捕获为单独的捕获组:

\|\s*([0-9]+)\s*\|\s*([\w]+)\s*\|\s*([\w]+)\s*\|

Or, use this one to capture the same way you're trying above (which will also get the headers): 或者,使用此方法来捕获您在上面尝试的相同方法(这也会获取标头):

\|\s*(.*?)\s*\|\s*(.*?)\s*\|\s*(.*?)\s*\|

Here's a demo of the first. 这是第一个的演示

As two other people have already said, you didn't escape your pipe character, which was messing up. 正如另外两个人已经说过的那样,您并没有逃脱管道字符的困扰,这很糟。

Also, you weren't taking into account whitespace on the edges of the words, so I added the \\s regex pattern and kept that outside of the captured group to better what you get out. 另外,您并没有考虑单词边缘的空白,因此我添加了\\s正则表达式模式,并将其保留在捕获的组之外以更好地获取内容。

Yes, something like the below would work; 是的,类似下面的内容会起作用;

import re
fh = open('test1.txt')
lines = fh.readlines()
for line in lines[2:]:
    p = re.compile(r"\|(?P<id>.*)\|(?P<login>.*)\|(?P<name>.*)\|")
    if p.search(line):
        id = re.match(p, line).group('id')
        login = re.match(p, line).group('login')
        name = re.match(p, line).group('name')
        print id.strip(),login.strip(),name.strip()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM