[英]Python - All the lines and line numbers in which string occurs in the input file
I want to print all the lines in which a string occurs in the input file, along with the line numbers . 我想打印输入文件中出现字符串的所有行以及行号。 So far I wrote the code shown below.
到目前为止,我编写了如下所示的代码。 It is working, but not in the way I wanted:
它正在工作,但不是我想要的方式:
def index(filepath, keyword):
with open(filepath) as f:
for lineno, line in enumerate(f, start=1):
matches = [k for k in keyword if k in line]
if matches:
result = "{:<15} {}".format(','.join(matches), lineno)
print(result)
print (line)
index('deneme.txt', ['elma'])
Output is as follows: 输出如下:
elma 15
Sogan+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj turunçgil+Noun+A3pl ihracat+Noun+P3sg+Dat devlet+Noun destek+Noun+P3sg ver+Verb+Pass+Prog2+Cop .+Punc
So far so good, But when I enter a keyword like "Sog"
it also finds the Sogan
but I don't want that, I only want to check tokens between whitespaces. 到目前为止,还不错,但是当我输入
"Sog"
之类的关键字时,它也找到了Sogan
但我不想这么做,我只想检查空白之间的标记。 I think I need to write regex for this and I got one but I couldn't now how can i add that regex to this code. 我想我需要为此编写正则表达式,但我得到了一个,但现在无法将该正则表达式添加到此代码中。
r'[\w+]+'
You could use the following regex: 您可以使用以下正则表达式:
import re
lines = [
'Sogan+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj turunçgil+Noun+A3pl ihracat+Noun+P3sg+Dat devlet+Noun destek+Noun+P3sg ver+Verb+Pass+Prog2+Cop .+Punc',
'Sog+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj turunçgil+Noun+A3pl ihracat+Noun+P3sg+Dat devlet+Noun destek+Noun+P3sg ver+Verb+Pass+Prog2+Cop .+Punc',
]
keywords = ['Sog']
pattern = re.compile('(\w+)\+')
for lineno, line in enumerate(lines):
words = set(m.group(1) for m in pattern.finditer(line)) # convert to set for efficiency
matches = [keyword for keyword in keywords if keyword in words]
if matches:
result = "{:<15} {}".format(','.join(matches), lineno)
print(result)
print(line)
Output 输出量
Sog 1
Sog+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj turunçgil+Noun+A3pl ihracat+Noun+P3sg+Dat devlet+Noun destek+Noun+P3sg ver+Verb+Pass+Prog2+Cop .+Punc
Explanation 说明
The pattern '(\\w+)\\+'
any group of letters followed by a +
character, +
is special character so you need to escape it, in order to match. 模式
'(\\w+)\\+'
任意一组字母,后跟一个+
字符,而+
是特殊字符,因此您需要对其进行转义以进行匹配。 Then use group to extract the matching group, (ie the group of letters). 然后使用group提取匹配的组(即字母组)。
Further 进一步
You will probably want to use the word boundary marker \\b
. 您可能需要使用单词边界标记
\\b
。 This is an empty match for transitions between \\w
and \\W
. 这是
\\w
和\\W
之间的过渡的空匹配。 If you want your keywords to be literal strings, you will have to escape them first. 如果您希望关键字是文字字符串,则必须先对其进行转义 。 You can combine everything into one regular expression using
|
您可以使用
|
将所有内容组合为一个正则表达式 : :
pattern = re.compile(r'\b(' + '|'.join(map(re.escape, keyword)) + r')\b')
OR 要么
pattern = re.compile(r'\b(?' + '|'.join(re.escape(k) for k in keyword) + r')\b')
Computing the matches is a bit easier now, since you can use finditer
instead of making your own comprehension: 现在,计算比赛要容易一些,因为您可以使用
finditer
而不是自己进行理解:
matches = pattern.finditer(line)
Since each match is enclosed in a group, printing is not much more difficult: 由于每个匹配被包围在一组,打印不可困难得多:
result = "{:<15} {}".format(','.join(m.group() for m in matches), lineno)
OR 要么
result = "{:<15} {}".format(','.join(map(re.Match.group(), matches)), lineno)
Of course, don't forget to 当然,别忘了
import re
Corner Case 角盒
If you have keywords that are subsets of each other with the same prefix, make sure the longer ones come first. 如果您使用彼此相同的前缀作为子集的关键字,请确保较长的关键字排在第一位。 For example, if you have
例如,如果您有
keyword = ['foo', 'foobar']
The regex will be 正则表达式将是
\b(foo|foobar)\b
When you encounter a line with foobar
in it, foo
will match successfully against it and then fail against \\b'. This is documented behavior of
当您遇到一行包含
foobar
的行时, foo
将针对该行成功匹配,然后针对\\b'. This is documented behavior of
失败\\b'. This is documented behavior of
\\b'. This is documented behavior of
|`. \\b'. This is documented behavior of
|` \\b'. This is documented behavior of
。 The solution is to pre-sort all your keywords by decreasing length before constructing the expression: 解决方案是在构造表达式之前,通过减小长度来对所有关键字进行预排序:
keywords.sort(key=len, reversed=True)
Or, if non-list inputs are possible: 或者,如果可以使用非列表输入:
keywords = sorted(keywords, key=len, reversed=True)
If you don't like this order, you can always print them in some other order after you match. 如果您不喜欢此顺序,则始终可以在匹配后按其他顺序打印它们。
Question : a keyword like "Sog" it also finds the Sogan ... I only want tokens between whitespaces.
问题 :像“ Sog”这样的关键字也可以找到Sogan...。我只希望空格之间有标记。 ... how can i add that regex to this code.
...我如何将该正则表达式添加到此代码中。
Build a regex
with your keywords
, use the or |
使用您的
keywords
构建一个regex
,使用or |
separator for multiple keywords
. 多个
keywords
分隔符。
For example: 例如:
import re
def index(lines, keyword):
rc = re.compile(".*?(({})\+.+?\s)".format(keyword))
for i, line in enumerate(lines):
match = rc.match(line)
if match:
print("lines[{}] match:{}\n{}".format(i, match.groups(), line))
if __name__ == "__main__":
lines = [
'Sogan+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elmaro+Noun ve+Conj ... (omitted for brevity)',
'Sog+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj ... (omitted for brevity)',
]
index(lines, 'elma')
index(lines, 'Sog|elma')
Output :
输出 :
lines[1] match:('elma+Noun ', 'elma') Sog+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj ... (omitted for brevity) lines[1] match:('Sog+Noun ', 'Sog') Sog+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj ... (omitted for brevity)
Tested with Python: 3.5 使用Python测试:3.5
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.