简体   繁体   English

Python-输入文件中出现字符串的所有行和行号

[英]Python - All the lines and line numbers in which string occurs in the input file

I want to print all the lines in which a string occurs in the input file, along with the line numbers . 我想打印输入文件中出现字符串的所有行以及行号。 So far I wrote the code shown below. 到目前为止,我编写了如下所示的代码。 It is working, but not in the way I wanted: 它正在工作,但不是我想要的方式:

def index(filepath, keyword):

    with open(filepath) as f:
        for lineno, line in enumerate(f, start=1):
            matches = [k for k in keyword if k in line]
            if matches:
                result = "{:<15} {}".format(','.join(matches), lineno)
                print(result)
                print (line)

index('deneme.txt', ['elma'])

Output is as follows: 输出如下:

elma            15
Sogan+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj turunçgil+Noun+A3pl ihracat+Noun+P3sg+Dat devlet+Noun destek+Noun+P3sg ver+Verb+Pass+Prog2+Cop .+Punc  

So far so good, But when I enter a keyword like "Sog" it also finds the Sogan but I don't want that, I only want to check tokens between whitespaces. 到目前为止,还不错,但是当我输入"Sog"之类的关键字时,它也找到了Sogan但我不想这么做,我只想检查空白之间的标记。 I think I need to write regex for this and I got one but I couldn't now how can i add that regex to this code. 我想我需要为此编写正则表达式,但我得到了一个,但现在无法将该正则表达式添加到此代码中。

r'[\w+]+'

You could use the following regex: 您可以使用以下正则表达式:

import re

lines = [
    'Sogan+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj turunçgil+Noun+A3pl ihracat+Noun+P3sg+Dat devlet+Noun destek+Noun+P3sg ver+Verb+Pass+Prog2+Cop .+Punc',
    'Sog+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj turunçgil+Noun+A3pl ihracat+Noun+P3sg+Dat devlet+Noun destek+Noun+P3sg ver+Verb+Pass+Prog2+Cop .+Punc',
]

keywords = ['Sog']
pattern = re.compile('(\w+)\+')

for lineno, line in enumerate(lines):
    words = set(m.group(1) for m in pattern.finditer(line))  # convert to set for efficiency
    matches = [keyword for keyword in keywords if keyword in words]
    if matches:
        result = "{:<15} {}".format(','.join(matches), lineno)
        print(result)
        print(line)

Output 输出量

Sog             1
Sog+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj turunçgil+Noun+A3pl ihracat+Noun+P3sg+Dat devlet+Noun destek+Noun+P3sg ver+Verb+Pass+Prog2+Cop .+Punc

Explanation 说明

The pattern '(\\w+)\\+' any group of letters followed by a + character, + is special character so you need to escape it, in order to match. 模式'(\\w+)\\+'任意一组字母,后跟一个+字符,而+是特殊字符,因此您需要对其进行转义以进行匹配。 Then use group to extract the matching group, (ie the group of letters). 然后使用group提取匹配的组(即字母组)。

Further 进一步

  1. Regular expression syntax 正则表达式语法

You will probably want to use the word boundary marker \\b . 您可能需要使用单词边界标记\\b This is an empty match for transitions between \\w and \\W . 这是\\w\\W之间的过渡的空匹配。 If you want your keywords to be literal strings, you will have to escape them first. 如果您希望关键字是文字字符串,则必须先对其进行转义 You can combine everything into one regular expression using | 您可以使用|将所有内容组合为一个正则表达式 :

pattern = re.compile(r'\b(' + '|'.join(map(re.escape, keyword)) + r')\b')

OR 要么

pattern = re.compile(r'\b(?' + '|'.join(re.escape(k) for k in keyword) + r')\b')

Computing the matches is a bit easier now, since you can use finditer instead of making your own comprehension: 现在,计算比赛要容易一些,因为您可以使用finditer而不是自己进行理解:

matches = pattern.finditer(line)

Since each match is enclosed in a group, printing is not much more difficult: 由于每个匹配被包围在一组,打印不可困难得多:

result = "{:<15} {}".format(','.join(m.group() for m in matches), lineno)

OR 要么

result = "{:<15} {}".format(','.join(map(re.Match.group(), matches)), lineno)

Of course, don't forget to 当然,别忘了

import re

Corner Case 角盒

If you have keywords that are subsets of each other with the same prefix, make sure the longer ones come first. 如果您使用彼此相同的前缀作为子集的关键字,请确保较长的关键字排在第一位。 For example, if you have 例如,如果您有

keyword = ['foo', 'foobar']

The regex will be 正则表达式将是

\b(foo|foobar)\b

When you encounter a line with foobar in it, foo will match successfully against it and then fail against \\b'. This is documented behavior of 当您遇到一行包含foobar的行时, foo将针对该行成功匹配,然后针对\\b'. This is documented behavior of失败\\b'. This is documented behavior of \\b'. This is documented behavior of |`. \\b'. This is documented behavior of |` \\b'. This is documented behavior of The solution is to pre-sort all your keywords by decreasing length before constructing the expression: 解决方案是在构造表达式之前,通过减小长度来对所有关键字进行预排序:

keywords.sort(key=len, reversed=True)

Or, if non-list inputs are possible: 或者,如果可以使用非列表输入:

keywords = sorted(keywords, key=len, reversed=True)

If you don't like this order, you can always print them in some other order after you match. 如果您不喜欢此顺序,则始终可以在匹配后按其他顺序打印它们。

Question : a keyword like "Sog" it also finds the Sogan ... I only want tokens between whitespaces. 问题 :像“ Sog”这样的关键字也可以找到Sogan...。我只希望空格之间有标记。 ... how can i add that regex to this code. ...我如何将该正则表达式添加到此代码中。

Build a regex with your keywords , use the or | 使用您的keywords构建一个regex ,使用or | separator for multiple keywords . 多个keywords分隔符。

For example: 例如:

import re

def index(lines, keyword):
    rc = re.compile(".*?(({})\+.+?\s)".format(keyword))

    for i, line in enumerate(lines):
        match = rc.match(line)
        if match:
            print("lines[{}] match:{}\n{}".format(i, match.groups(), line))

if __name__ == "__main__":
    lines = [
    'Sogan+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elmaro+Noun ve+Conj ... (omitted for brevity)',
    'Sog+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj ... (omitted for brevity)',
]
    index(lines, 'elma')
    index(lines, 'Sog|elma')

Output : 输出

 lines[1] match:('elma+Noun ', 'elma') Sog+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj ... (omitted for brevity) lines[1] match:('Sog+Noun ', 'Sog') Sog+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj ... (omitted for brevity) 

Tested with Python: 3.5 使用Python测试:3.5

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 删除文件中的所有行,直到与字符串一致 - Delete all lines in file until line with string Python,将文件中具有特定字符串的所有行添加到列表中,然后随机选择要打印的字符串? - Python, Adding all lines with a certain string from a file to list then randomly choosing which string to print? 创建一个函数,该函数从文本文件中读取所有代码行,并将每一行放入元组中。 -Python 3 - Create a function which reads all lines of code from text file and put each line into tuples. -Python 3 在python中出现搜索项的打印行 - Print line in which search item occurs in python Python中出现空行时停止输入 - Stop input when empty line occurs in Python 如何在 python 的列表中输入所有输入都在一行中? - How to take input in list in python in which all input are in one line? 我在“.txt”文件中有 3 行,每行包含 3 个数字,我想根据用户输入对他们的数字“+1” - I have 3 lines in a `.txt` file each line containing 3 numbers and I want to `+1` to them numbers depending on the user input Python 3:在文本文件中,获取在x出现的y行中出现的字符串 - Python 3 : in a text file, get the x occurrences of a string in the y lines where it occurs 从文件python中删除字符串和字符串之前的所有行 - remove string and all lines before string from file python 提取所有行,包括包含子字符串的行以及python中子字符串之后的行 - Extract all lines including the line which contains the substring and lines after the substring in python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM