简体   繁体   English

查找文件中的哪些行包含特定字符

[英]Find which lines in a file contain certain characters

Is there a way to find out if a string contains any one of the characters in a set with python? 有没有办法找出一个字符串是否包含一组python中的任何一个字符?

It's straightforward to do it with a single character, but I need to check and see if a string contains any one of a set of bad characters. 使用单个字符执行此操作非常简单,但我需要检查并查看字符串是否包含任何一组错误字符。

Specifically, suppose I have a string: 具体来说,假设我有一个字符串:

s = 'amanaplanacanalpanama~012345'

and I want to see if the string contains any vowels: 我想看看字符串是否包含任何元音:

bad_chars = 'aeiou'

and do this in a for loop for each line in a file: 并在文件中每一行的for循环中执行此操作:

if [any one or more of the bad_chars] in s:
    do something

I am scanning a large file so if there is a faster method to this, that would be ideal. 我正在扫描一个大文件,所以如果有更快的方法,这将是理想的。 Also, not every bad character has to be checked---so long as one is encountered that is enough to end the search. 此外,不是每个坏人都必须被检查---只要遇到一个足以结束搜索的人。

I'm not sure if there is a builtin function or easy way to implement this, but I haven't come across anything yet. 我不确定是否有内置函数或简单的方法来实现它,但我还没有遇到过任何问题。 Any pointers would be much appreciated! 任何指针将非常感谢!

any((c in badChars) for c in yourString)

or 要么

any((c in yourString) for c in badChars)  # extensionally equivalent, slower

or 要么

set(yourString) & set(badChars)  # extensionally equivalent, slower

"so long as one is encountered that is enough to end the search." “只要遇到一个足以结束搜索的人。” - This will be true if you use the first method. - 如果您使用第一种方法,则会出现这种情况。

You say you are concerned with performance: performance should not be an issue unless you are dealing with a huge amount of data. 你说你关心的是性能:除非你处理大量数据,否则性能不应成为问题。 If you encounter issues, you can try: 如果遇到问题,可以尝试:


Regexes 正则表达式

edit Previously I had written a section here on using regexes, via the re module, programatically generating a regex that consisted of a single character-class [...] and using .finditer , with the caveat that putting a simple backslash before everything might not work correctly. 编辑以前我在这里编写了一个关于使用正则表达式的部分,通过re模块,以编程方式生成由单个字符类[...].finditer组成的正则表达式,并注意在所有内容之前放置一个简单的反斜杠不正常。 Indeed, after testing it, that is the case, and I would definitely not recommend this method. 事实上,在测试之后,就是这种情况,我绝对不会推荐这种方法。 Using this would require reverse engineering the entire (slightly complex) sub-grammar of regex character classes (eg you might have characters like \\ followed by w , like ] or [ , or like - , and merely escaping some like \\w may give it a new meaning). 使用这需要逆向工程的整个(稍微复杂)的正则表达式字符类的子语法(例如,你可能有这样的字符\\之后w ,像][ ,或类似-而只是逃避一些像\\w可以给它新意义)。


Sets

Depending on whether the str.__contains__ operation is O(1) or O(N), it may be justifiable to first convert your text/lines into a set to ensure the in operation is O(1), if you have many badChars: 根据是否在str.__contains__操作是O(1)或O(N),它可能是有道理的文本/线先转换成一组,以确保in操作O(1),如果你有很多badChars:

badCharSet = set(badChars)
any((c in badChars) for c in yourString)

(it may be possible to make that a one-liner any((c in set(yourString)) for c in badChars) , depending on how smart the python compiler is) (根据python编译器的智能程度,可能可以为any((c in set(yourString)) for c in badChars)


Do you really need to do this line-by-line? 你真的需要逐行吗?

It may be faster to do this once for the entire file O(#badchars), than once for every line in the file O(#lines*#badchars), though the asymptotic constants may be such that it won't matter. 对于整个文件O(#badchars)执行此操作可能更快一次,而对于文件O中的每一行(#lines * #badchars)执行此操作可能更快,尽管渐近常量可能无关紧要。

Use python's any function. 使用python的any函数。

if any((bad_char in my_string) for bad_char in bad_chars):
    # do something 

This should be very efficient and clear. 这应该非常有效和清晰。 It uses sets: 它使用集合:

#!/usr/bin/python

bad_chars = set('aeiou')

with open('/etc/passwd', 'r') as file_:
   file_string = file_.read()
file_chars = set(file_string)

if file_chars & bad_chars:
   print('found something bad')

This regular expression is twice as fast as any with my minimal testing. 这个正则表达式快两倍, any与我的最小的测试。 You should try it with your own data. 您应该使用自己的数据进行尝试。

r = re.compile('[aeiou]')
if r.search(s):
    # do something

The following Python code should print out any character in bad_chars if it exists in s: 以下Python代码应打印出bad_chars中的任何字符(如果它存在于s中):

for i in vowels:
    if i in your charset:
        #do_something

You could also use the python in-built any using an example like this: 您也可以使用这样的示例使用内置的python:

>>> any(e for e in bad_chars if e in s)
True

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从文件中读取行,然后仅将这些行写入不包含某个短语的文件 - Read lines from file and subsequently only write those lines to the file which don't contain a certain phrase 查找包含特定单词的行 - Find lines which contain a specific word 使用python从文本文件中删除不包含某些字符串的行 - Remove lines from a text file which do not contain a certain string with python Python:读取文件并排除具有某些字符的行 - Python: reading a file and excluding lines with certain characters 在 python 中,如何打印不包含某个字符串的行,而不是打印包含某个字符串的行: - In python, how can I print lines that do NOT contain a certain string, rather than print lines which DO contain a certain string: 删除文件中包含python中某个变量的行 - Removing lines in my file that contain a certain variable in python 如何在 Python 中打印包含某个字符串的文本文件的行? - How to print the lines of a text file that contain a certain string in Python? 正则表达式-查找不包含某些字符的字符串 - Regex - find string which does not contain certain char 如何查找包含用户输入的前3个字符的单词 - How to find words which contain the first 3 characters input by a user 正则表达式模式查找最多包含 3 个字母数字字符的所有字符串 - Regex pattern to find all strings which contain at most 3 alphanumeric characters
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM