I am trying to find if any substring in a list of substrings is in a given string. To do so, I loop over the items of the list and check if they exist in the string using python's in
operator. I am getting False values even though I am sure one of the substrings exists in the string. I have tried all the methods meant to unify the text and the substrings: replaced all " " with "", used casefold()
method, strip()
, even used unidecode
. Still, the substring is not found.
My code:
from unidecode import unidecode
example_string = '''available at www.sciencedirect.com
journal homepage: www.elsevier.com/locate/nanotoday
REVIEW
Synthesis, properties and applications of Janus
nanoparticles
Marco Lattuada a, T. Alan Hatton b,''' # as extracted from PDF file using fitz's `doc.load_page(0)` and then `.get_text()`
list_of_titles = ["Synthesis, properties and applications of Janus nanoparticles", "another_title", "another_title"]
example_string = example_string.casefold()
example_string = example_string.replace(" ", "")
for title in list_of_titles:
title = title.replace(" ", "")
title = title.casefold()
if unidecode(title) in unidecode(example_string):
print("Yes")
# Outputs nothing
Try with
example_string = example_string.replace("\n", " ")
example_string = example_string.casefold()
for title in list_of_titles:
if title.casefold() in example_string: # here casefold() again!
print("Yes")
I think the \n
make some conflicts
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.