繁体   English   中英

Python:打印/获取每个段落的第一句话

[英]Python: print/get first sentence of each paragraph

这是我的代码,但它打印整段。 如何打印第一个句子,直到第一个点?

from bs4 import BeautifulSoup
import urllib.request,time

article = 'https://www.theguardian.com/science/2012/\
oct/03/philosophy-artificial-intelligence'

req = urllib.request.Request(article, headers={'User-agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()

soup = BeautifulSoup(html,'lxml')

def print_intro():
    if len(soup.find_all('p')[0].get_text()) > 100:
        print(soup.find_all('p')[0].get_text())

此代码打印:

要说人类大脑具有的能力在某些方面远远优于宇宙中所有其他已知物体的能力,那将是无可争议的。 大脑是唯一能够理解宇宙甚至在那里的物体,或者为什么有无数多的素数,或者因为时空的曲率而使苹果掉落,或者说服从它自己的天生本能可以在道德上错了,或它本身存在。 它的独特能力也不局限于这些大脑问题。 冷酷的物理事实是,它是唯一能够将自己推进太空而无伤害的物体,或预测和防止流星撞击本身,或将物体冷却至绝对零度以上十亿分之一,或探测到银河系距离之外的其他类型。

但我只想要打印:

要说人类大脑具有的能力在某些方面远远优于宇宙中所有其他已知物体的能力,那将是无可争议的。

感谢帮助

拆分该点上的文字; 对于仅仅由单一分割,使用str.partition()快于str.split()与一限制:

text = soup.find_all('p')[0].get_text()
if len(text) > 100:
    text = text.partition('.')[0] + '.'
print(text)

如果您只需要处理第一个 <p>元素,请使用soup.find()代替:

text = soup.find('p').get_text()
if len(text) > 100:
    text = text.partition('.')[0] + '.'
print(text)

为了您给出的网址,但该示例文本发现,作为第二款

>>> soup.find_all('p')[1]
<p><span class="drop-cap"><span class="drop-cap__inner">T</span></span>o state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial. The brain is the only kind of object capable of understanding that the cosmos is even there, or why there are infinitely many prime numbers, or that apples fall because of the curvature of space-time, or that obeying its own inborn instincts can be morally wrong, or that it itself exists. Nor are its unique abilities confined to such cerebral matters. The cold, physical fact is that it is the only kind of object that can propel itself into space and back without harm, or predict and prevent a meteor strike on itself, or cool objects to a billionth of a degree above absolute zero, or detect others of its kind across galactic distances.</p>
>>> text = soup.find_all('p')[1].get_text()
>>> text.partition('.')[0] + '.'
'To state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial.'
def print_intro():
    if len(soup.find_all('p')[0].get_text()) > 100:
        paragraph = soup.find_all('p')[0].get_text()
        phrase_list = paragraph.split('.')
        print(phrase_list[0])

在第一个period split该段落。 参数1MAXSPLIT ,可以节省您不必要的额外分割时间。

def print_intro():
    if len(soup.find_all('p')[0].get_text()) > 100:
        my_paragraph = soup.find_all('p')[0].get_text()
        my_list = my_paragraph.split('.', 1)
        print(my_list[0])

你可以使用find('.') ,它返回你正在寻找的第一个出现的索引。

因此,如果段落存储在名为paragraph的变量中

sentence_index = paragraph.find('.')
# add the '.'
sentence += 1
print(paragraph[0: sentence_index])

显然,这里缺少控制部分,如检查paragraph变量中包含的字符串是否为'。' 等等..无论如何find()返回-1,如果它找不到你正在寻找的子串。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM