簡體   English   中英

Python:打印/獲取每個段落的第一句話

[英]Python: print/get first sentence of each paragraph

這是我的代碼,但它打印整段。 如何打印第一個句子,直到第一個點?

from bs4 import BeautifulSoup
import urllib.request,time

article = 'https://www.theguardian.com/science/2012/\
oct/03/philosophy-artificial-intelligence'

req = urllib.request.Request(article, headers={'User-agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()

soup = BeautifulSoup(html,'lxml')

def print_intro():
    if len(soup.find_all('p')[0].get_text()) > 100:
        print(soup.find_all('p')[0].get_text())

此代碼打印:

要說人類大腦具有的能力在某些方面遠遠優於宇宙中所有其他已知物體的能力,那將是無可爭議的。 大腦是唯一能夠理解宇宙甚至在那里的物體,或者為什么有無數多的素數,或者因為時空的曲率而使蘋果掉落,或者說服從它自己的天生本能可以在道德上錯了,或它本身存在。 它的獨特能力也不局限於這些大腦問題。 冷酷的物理事實是,它是唯一能夠將自己推進太空而無傷害的物體,或預測和防止流星撞擊本身,或將物體冷卻至絕對零度以上十億分之一,或探測到銀河系距離之外的其他類型。

但我只想要打印:

要說人類大腦具有的能力在某些方面遠遠優於宇宙中所有其他已知物體的能力,那將是無可爭議的。

感謝幫助

拆分該點上的文字; 對於僅僅由單一分割,使用str.partition()快於str.split()與一限制:

text = soup.find_all('p')[0].get_text()
if len(text) > 100:
    text = text.partition('.')[0] + '.'
print(text)

如果您只需要處理第一個 <p>元素,請使用soup.find()代替:

text = soup.find('p').get_text()
if len(text) > 100:
    text = text.partition('.')[0] + '.'
print(text)

為了您給出的網址,但該示例文本發現,作為第二款

>>> soup.find_all('p')[1]
<p><span class="drop-cap"><span class="drop-cap__inner">T</span></span>o state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial. The brain is the only kind of object capable of understanding that the cosmos is even there, or why there are infinitely many prime numbers, or that apples fall because of the curvature of space-time, or that obeying its own inborn instincts can be morally wrong, or that it itself exists. Nor are its unique abilities confined to such cerebral matters. The cold, physical fact is that it is the only kind of object that can propel itself into space and back without harm, or predict and prevent a meteor strike on itself, or cool objects to a billionth of a degree above absolute zero, or detect others of its kind across galactic distances.</p>
>>> text = soup.find_all('p')[1].get_text()
>>> text.partition('.')[0] + '.'
'To state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial.'
def print_intro():
    if len(soup.find_all('p')[0].get_text()) > 100:
        paragraph = soup.find_all('p')[0].get_text()
        phrase_list = paragraph.split('.')
        print(phrase_list[0])

在第一個period split該段落。 參數1MAXSPLIT ,可以節省您不必要的額外分割時間。

def print_intro():
    if len(soup.find_all('p')[0].get_text()) > 100:
        my_paragraph = soup.find_all('p')[0].get_text()
        my_list = my_paragraph.split('.', 1)
        print(my_list[0])

你可以使用find('.') ,它返回你正在尋找的第一個出現的索引。

因此,如果段落存儲在名為paragraph的變量中

sentence_index = paragraph.find('.')
# add the '.'
sentence += 1
print(paragraph[0: sentence_index])

顯然,這里缺少控制部分,如檢查paragraph變量中包含的字符串是否為'。' 等等..無論如何find()返回-1,如果它找不到你正在尋找的子串。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM