简体   繁体   English

正则表达式:以关键字结尾的句子以句子空白 Output 结尾

[英]Regex: Sentence Beginning With Keyword Ending with Sentence Blank Output

If I was to print "m," there would be a result that begins with "Histology" and ends with a period.如果我要打印“m”,将会有一个以“Histology”开头并以句点结尾的结果。 Despite that, the output shows up empty.尽管如此,output 还是显示为空。

from bs4 import BeautifulSoup
from googlesearch import search 
import requests
from goose3 import Goose
def search_google(query):
    parent_=[]  
    for j in search(query, tld="co.in", num=10, stop=5, pause=2):
        child_=[]
        link_=j
        site_name=link_.split("/")[2]
        child_.append(site_name)
        child_.append(link_)
        parent_.append(child_)  
        g = Goose()
        article = g.extract(link_)
        m = article.cleaned_text
    Answer = re.findall(r'\bHistology\s+([^.]*)',m)  
    print(Answer)

f = search_google("""'Histology'""")

Output: [] Output:[]

It seems your answer variable has incorrect indentation, and your last result has no matches in the cleaned text.您的answer变量似乎缩进不正确,并且您的最后一个结果在清理后的文本中没有匹配项。 This is why your print results in a empty list.这就是为什么您的打印结果为空列表。

The print command, since it sits outside of the loop only triggers once.打印命令,因为它位于循环之外,只触发一次。 And given the final value of Answer has no matches, you are returned an empty list.鉴于Answer的最终值没有匹配项,您将返回一个空列表。

Indent the answer variable by 1 and it should output the correct result.answer变量缩进 1,它应该是 output 正确的结果。

Your regex will also only match the sentence following Histology and not include the word itself.您的正则表达式也将只匹配Histology后面的句子,而不包括单词本身。 This is due to you specifying a capture group without Histology included.这是因为您指定了一个不包括Histology的捕获组。 You can resolve this by removing the capturing group.您可以通过删除捕获组来解决此问题。

r'\bHistology\s+[^.]*'

from bs4 import BeautifulSoup
from googlesearch import search 
import requests
from goose3 import Goose
def search_google(query):
    parent_=[]  
    for j in search(query, tld="co.in", num=10, stop=5, pause=2):
        child_=[]
        link_=j
        site_name=link_.split("/")[2]
        child_.append(site_name)
        child_.append(link_)
        parent_.append(child_)  
        g = Goose()
        article = g.extract(link_)
        m = article.cleaned_text
        Answer = re.findall(r'\bHistology\s+[^.]*',m)  
        print(Answer)


f = search_google("""'Histology'""")

To print all results on individual lines you can change print(Answer) to print('\n'.join(Answer))要在单独的行上打印所有结果,您可以将print(Answer)更改为print('\n'.join(Answer))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM