I am trying to use for loop to re.findall() in jupyter notebook. I want to extract all the sentences that contains 'California', 'Colorado', and 'Florida'. I can just write these.
import re
f =open("C:/Users/uib57309/Desktop/test.txt",mode='rt')
lines = f.read()
f.close()
re.findall(r"([^.]*?California[^.]*\.)",lines)
re.findall(r"([^.]*?Colorado[^.]*\.)",lines)
re.findall(r"([^.]*?Florida[^.]*\.)",lines)
But how can I shorten my code with for loop? I tried like these, but this seems to be wrong.
test_list = ['California', 'Colorado', 'Florida']
for i in test_list:
result = re.findall(r"([^.]*?i[^.]*\.)",lines)
print(result)
In your for loop, result is finding all searches with the literal "i" string character. Use the f-string (for 3.6+); string concatenation or formatting is okay too:
result = re.findall(f"([^.]*?{i}[^.]*\\.)", lines) # works in Python 3.6+
If you really want to do it in a clean way, you must use NLTK to separate sentences. Your code relies on the assumption that a period always separates sentences, but, in general, that is not true.
import nltk
import re
lines = "Hello, California! Hello, e.g., Florida? Bye Massachusetts"
states = ['California', 'Colorado', 'Florida']
# Create a regex from the list of states
states_re = re.compile("|".join(states))
results = [sent for sent in nltk.sent_tokenize(lines) \
if states_re.search(sent)] # Check the condition
#['Hello, California!', 'Hello, e.g., Florida?']
you don't need a loop, just create a regex with "|".join
test_list = ['California', 'Colorado', 'Florida']
result = re.findall(r"([^.]*?{}[^.]*\.)".format("|".join(test_list)),lines)
and to make sure the words aren't sub-strings use word boundary (not really necessary with those particular words but for the general case it is. Then the expression uses one more wrapping with r \\b
characters:
r"([^.]*?{}[^.]*\.)".format("|".join([r"\b{}\b".format(x) for x in test_list]))
Use word boundary for this task and also make a list to store.
result
variable will be overwritten with each iteration of loop.
test_list = ['California', 'Colorado', 'Florida']
x = []
for i in test_list:
pattern = r"\b"+i+r"\b"
result = re.findall(pattern,lines)
x.append(result)
print(x)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.