Python-输入文件中出现字符串的所有行和行号

Question

I want to print all the lines in which a string occurs in the input file, along with the line numbers . 我想打印输入文件中出现字符串的所有行以及行号。 So far I wrote the code shown below. 到目前为止，我编写了如下所示的代码。 It is working, but not in the way I wanted: 它正在工作，但不是我想要的方式：

def index(filepath, keyword):

    with open(filepath) as f:
        for lineno, line in enumerate(f, start=1):
            matches = [k for k in keyword if k in line]
            if matches:
                result = "{:<15} {}".format(','.join(matches), lineno)
                print(result)
                print (line)

index('deneme.txt', ['elma'])

Output is as follows: 输出如下：

elma            15
Sogan+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj turunçgil+Noun+A3pl ihracat+Noun+P3sg+Dat devlet+Noun destek+Noun+P3sg ver+Verb+Pass+Prog2+Cop .+Punc

So far so good, But when I enter a keyword like "Sog" it also finds the Sogan but I don't want that, I only want to check tokens between whitespaces. 到目前为止，还不错，但是当我输入"Sog"之类的关键字时，它也找到了Sogan但我不想这么做，我只想检查空白之间的标记。 I think I need to write regex for this and I got one but I couldn't now how can i add that regex to this code. 我想我需要为此编写正则表达式，但我得到了一个，但现在无法将该正则表达式添加到此代码中。

r'[\w+]+'

Answer 1

You could use the following regex: 您可以使用以下正则表达式：

import re

lines = [
    'Sogan+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj turunçgil+Noun+A3pl ihracat+Noun+P3sg+Dat devlet+Noun destek+Noun+P3sg ver+Verb+Pass+Prog2+Cop .+Punc',
    'Sog+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj turunçgil+Noun+A3pl ihracat+Noun+P3sg+Dat devlet+Noun destek+Noun+P3sg ver+Verb+Pass+Prog2+Cop .+Punc',
]

keywords = ['Sog']
pattern = re.compile('(\w+)\+')

for lineno, line in enumerate(lines):
    words = set(m.group(1) for m in pattern.finditer(line))  # convert to set for efficiency
    matches = [keyword for keyword in keywords if keyword in words]
    if matches:
        result = "{:<15} {}".format(','.join(matches), lineno)
        print(result)
        print(line)

Output 输出量

Sog             1
Sog+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj turunçgil+Noun+A3pl ihracat+Noun+P3sg+Dat devlet+Noun destek+Noun+P3sg ver+Verb+Pass+Prog2+Cop .+Punc

Explanation 说明

The pattern '(\\w+)\\+' any group of letters followed by a + character, + is special character so you need to escape it, in order to match. 模式'(\\w+)\\+'任意一组字母，后跟一个+字符，而+是特殊字符，因此您需要对其进行转义以进行匹配。 Then use group to extract the matching group, (ie the group of letters). 然后使用group提取匹配的组（即字母组）。

Further 进一步

Regular expression syntax 正则表达式语法

Answer 2

You will probably want to use the word boundary marker \\b . 您可能需要使用单词边界标记\\b 。 This is an empty match for transitions between \\w and \\W . 这是\\w和\\W之间的过渡的空匹配。 If you want your keywords to be literal strings, you will have to escape them first. 如果您希望关键字是文字字符串，则必须先对其进行转义。 You can combine everything into one regular expression using | 您可以使用|将所有内容组合为一个正则表达式 : ：

pattern = re.compile(r'\b(' + '|'.join(map(re.escape, keyword)) + r')\b')

OR 要么

pattern = re.compile(r'\b(?' + '|'.join(re.escape(k) for k in keyword) + r')\b')

Computing the matches is a bit easier now, since you can use finditer instead of making your own comprehension: 现在，计算比赛要容易一些，因为您可以使用finditer而不是自己进行理解：

matches = pattern.finditer(line)

Since each match is enclosed in a group, printing is not much more difficult: 由于每个匹配被包围在一组，打印不可困难得多：

result = "{:<15} {}".format(','.join(m.group() for m in matches), lineno)

OR 要么

result = "{:<15} {}".format(','.join(map(re.Match.group(), matches)), lineno)

Of course, don't forget to 当然，别忘了

import re

Corner Case 角盒

If you have keywords that are subsets of each other with the same prefix, make sure the longer ones come first. 如果您使用彼此相同的前缀作为子集的关键字，请确保较长的关键字排在第一位。 For example, if you have 例如，如果您有

keyword = ['foo', 'foobar']

The regex will be 正则表达式将是

\b(foo|foobar)\b

When you encounter a line with foobar in it, foo will match successfully against it and then fail against \\b'. This is documented behavior of 当您遇到一行包含foobar的行时， foo将针对该行成功匹配，然后针对\\b'. This is documented behavior of失败\\b'. This is documented behavior of \\b'. This is documented behavior of |`. \\b'. This is documented behavior of |` \\b'. This is documented behavior of 。 The solution is to pre-sort all your keywords by decreasing length before constructing the expression: 解决方案是在构造表达式之前，通过减小长度来对所有关键字进行预排序：

keywords.sort(key=len, reversed=True)

Or, if non-list inputs are possible: 或者，如果可以使用非列表输入：

keywords = sorted(keywords, key=len, reversed=True)

If you don't like this order, you can always print them in some other order after you match. 如果您不喜欢此顺序，则始终可以在匹配后按其他顺序打印它们。

Answer 3

Question : a keyword like "Sog" it also finds the Sogan ... I only want tokens between whitespaces. 问题：像“ Sog”这样的关键字也可以找到Sogan...。我只希望空格之间有标记。 ... how can i add that regex to this code. ...我如何将该正则表达式添加到此代码中。

Build a regex with your keywords , use the or | 使用您的keywords构建一个regex ，使用or | separator for multiple keywords . 多个keywords分隔符。

For example: 例如：

import re

def index(lines, keyword):
    rc = re.compile(".*?(({})\+.+?\s)".format(keyword))

    for i, line in enumerate(lines):
        match = rc.match(line)
        if match:
            print("lines[{}] match:{}\n{}".format(i, match.groups(), line))

if __name__ == "__main__":
    lines = [
    'Sogan+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elmaro+Noun ve+Conj ... (omitted for brevity)',
    'Sog+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj ... (omitted for brevity)',
]
    index(lines, 'elma')
    index(lines, 'Sog|elma')

Output : 输出：

 lines[1] match:('elma+Noun ', 'elma') Sog+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj ... (omitted for brevity) lines[1] match:('Sog+Noun ', 'Sog') Sog+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj ... (omitted for brevity)

Tested with Python: 3.5 使用Python测试：3.5

Python-输入文件中出现字符串的所有行和行号

问题描述

3 个解决方案

解决方案1
1 2018-10-27 13:44:29

解决方案2
1 已采纳 2018-10-27 13:56:17

解决方案3
1 2018-10-27 14:24:46

Python-输入文件中出现字符串的所有行和行号

问题描述

3 个解决方案

解决方案1 1 2018-10-27 13:44:29

解决方案2 1 已采纳 2018-10-27 13:56:17

解决方案3 1 2018-10-27 14:24:46

解决方案1
1 2018-10-27 13:44:29

解决方案2
1 已采纳 2018-10-27 13:56:17

解决方案3
1 2018-10-27 14:24:46