python：如何在findall中使用for循环

Question

I am trying to use for loop to re.findall() in jupyter notebook. 我试图在jupyter笔记本中使用for循环到re.findall（）。 I want to extract all the sentences that contains 'California', 'Colorado', and 'Florida'. 我想提取包含'California'，'Colorado'和'Florida'的所有句子。 I can just write these. 我可以写这些。

import re

f =open("C:/Users/uib57309/Desktop/test.txt",mode='rt')
lines = f.read()
f.close()

re.findall(r"([^.]*?California[^.]*\.)",lines)

re.findall(r"([^.]*?Colorado[^.]*\.)",lines)

re.findall(r"([^.]*?Florida[^.]*\.)",lines)

But how can I shorten my code with for loop? 但是如何用for循环缩短我的代码呢？ I tried like these, but this seems to be wrong. 我试过这些，但这似乎是错的。

test_list = ['California', 'Colorado', 'Florida'] 

for i in test_list: 

     result = re.findall(r"([^.]*?i[^.]*\.)",lines)

print(result)

Answer 1

In your for loop, result is finding all searches with the literal "i" string character. 在for循环中，结果是使用文字“i”字符串字符查找所有搜索。 Use the f-string (for 3.6+); 使用f-string（3.6+）; string concatenation or formatting is okay too: 字符串连接或格式化也是可以的：

result = re.findall(f"([^.]*?{i}[^.]*\\.)", lines) # works in Python 3.6+

Answer 2

If you really want to do it in a clean way, you must use NLTK to separate sentences. 如果你真的想以干净的方式去做，你必须使用NLTK来分隔句子。 Your code relies on the assumption that a period always separates sentences, but, in general, that is not true. 你的代码依赖于一个句点总是将句子分开的假设，但总的来说，这不是真的。

import nltk
import re

lines = "Hello, California! Hello, e.g., Florida? Bye Massachusetts"

states = ['California', 'Colorado', 'Florida'] 

# Create a regex from the list of states
states_re = re.compile("|".join(states)) 

results = [sent for sent in nltk.sent_tokenize(lines) \
           if states_re.search(sent)] # Check the condition
#['Hello, California!', 'Hello, e.g., Florida?']

Answer 3

you don't need a loop, just create a regex with "|".join 你不需要循环，只需用"|".join创建一个正则表达式

test_list = ['California', 'Colorado', 'Florida']
result = re.findall(r"([^.]*?{}[^.]*\.)".format("|".join(test_list)),lines)

and to make sure the words aren't sub-strings use word boundary (not really necessary with those particular words but for the general case it is. Then the expression uses one more wrapping with r \\b characters: 并确保单词不是子字符串使用单词边界（对于那些特定的单词不是真的必要，但对于一般情况它是。然后表达式再使用r \\b字符包装：

r"([^.]*?{}[^.]*\.)".format("|".join([r"\b{}\b".format(x) for x in test_list]))

Answer 4

Use word boundary for this task and also make a list to store. 使用单词边界执行此任务，并列出要存储的列表。

result variable will be overwritten with each iteration of loop. 每次迭代循环都会覆盖result变量。

test_list = ['California', 'Colorado', 'Florida'] 
x = []

for i in test_list: 
    pattern = r"\b"+i+r"\b"
    result = re.findall(pattern,lines)
    x.append(result)

print(x)

python：如何在findall中使用for循环

问题描述

4 个解决方案

解决方案1
1 2019-01-15 05:38:54

解决方案2
1 2019-01-15 05:58:01

解决方案3
1 已采纳 2019-01-15 05:58:10

解决方案4
0 2019-01-15 05:37:43

python：如何在findall中使用for循环

问题描述

4 个解决方案

解决方案1 1 2019-01-15 05:38:54

解决方案2 1 2019-01-15 05:58:01

解决方案3 1 已采纳 2019-01-15 05:58:10

解决方案4 0 2019-01-15 05:37:43

解决方案1
1 2019-01-15 05:38:54

解决方案2
1 2019-01-15 05:58:01

解决方案3
1 已采纳 2019-01-15 05:58:10

解决方案4
0 2019-01-15 05:37:43