简体   繁体   中英

Python: print/get first sentence of each paragraph

This is the code I have, but it prints the whole paragraph. How to print the first sentence only, up to the first dot?

from bs4 import BeautifulSoup
import urllib.request,time

article = 'https://www.theguardian.com/science/2012/\
oct/03/philosophy-artificial-intelligence'

req = urllib.request.Request(article, headers={'User-agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()

soup = BeautifulSoup(html,'lxml')

def print_intro():
    if len(soup.find_all('p')[0].get_text()) > 100:
        print(soup.find_all('p')[0].get_text())

This code prints:

To state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial. The brain is the only kind of object capable of understanding that the cosmos is even there, or why there are infinitely many prime numbers, or that apples fall because of the curvature of space-time, or that obeying its own inborn instincts can be morally wrong, or that it itself exists. Nor are its unique abilities confined to such cerebral matters. The cold, physical fact is that it is the only kind of object that can propel itself into space and back without harm, or predict and prevent a meteor strike on itself, or cool objects to a billionth of a degree above absolute zero, or detect others of its kind across galactic distances.

BUT I ONLY want it to print:

To state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial.

Thanks for help

Split the text on that dot; for a single split, using str.partition() is faster than str.split() with a limit:

text = soup.find_all('p')[0].get_text()
if len(text) > 100:
    text = text.partition('.')[0] + '.'
print(text)

If you only need to process the first <p> element, use soup.find() instead:

text = soup.find('p').get_text()
if len(text) > 100:
    text = text.partition('.')[0] + '.'
print(text)

For your given URL, however, the sample text is found as the second paragraph:

>>> soup.find_all('p')[1]
<p><span class="drop-cap"><span class="drop-cap__inner">T</span></span>o state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial. The brain is the only kind of object capable of understanding that the cosmos is even there, or why there are infinitely many prime numbers, or that apples fall because of the curvature of space-time, or that obeying its own inborn instincts can be morally wrong, or that it itself exists. Nor are its unique abilities confined to such cerebral matters. The cold, physical fact is that it is the only kind of object that can propel itself into space and back without harm, or predict and prevent a meteor strike on itself, or cool objects to a billionth of a degree above absolute zero, or detect others of its kind across galactic distances.</p>
>>> text = soup.find_all('p')[1].get_text()
>>> text.partition('.')[0] + '.'
'To state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial.'
def print_intro():
    if len(soup.find_all('p')[0].get_text()) > 100:
        paragraph = soup.find_all('p')[0].get_text()
        phrase_list = paragraph.split('.')
        print(phrase_list[0])

split the paragraph at the first period . Argument 1 species the MAXSPLIT and saves your time from unneccessary extra splitting.

def print_intro():
    if len(soup.find_all('p')[0].get_text()) > 100:
        my_paragraph = soup.find_all('p')[0].get_text()
        my_list = my_paragraph.split('.', 1)
        print(my_list[0])

you can use find('.') , it return the index of the first occurence of what you're looking for.

So if the paragraph is stored in a variable called paragraph

sentence_index = paragraph.find('.')
# add the '.'
sentence += 1
print(paragraph[0: sentence_index])

Obviously here is missing the control part like check if the string contained in paragraph variable has '.' etc.. anyway find() return -1 if it does not find the substring you're looking for.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM