简体   繁体   English

如何获取python网页的前3句?

[英]How do I get the first 3 sentences of a webpage in python?

I have an assignment where one of the things I can do is find the first 3 sentences of a webpage and display it.我有一项作业,我可以做的其中一件事是找到网页的前 3 个句子并显示它。 Find the webpage text is easy enough, but I'm having problems figuring out how I find the first 3 sentences.查找网页文本很容易,但我在弄清楚如何找到前 3 个句子时遇到了问题。

import requests
from bs4 import BeautifulSoup

url = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)

output = ''
blacklist = [
      '[document]',
      'noscript',
      'header',
      'html',
      'meta',
      'head',
      'input',
      'script'
]

for t in text:
  if (t.parent.name not in blacklist):
    output += '{} '.format(t)

tempout = output.split('.')
for i in range(tempout):
  if (i >= 3):
    tempout.remove(i)

output = '.'.join(tempout)

print(output)

Finding sentences out of text is difficult.从文本中查找句子很困难。 Normally you would look for characters that might complete a sentence, such as '.'通常,您会查找可以完成一个句子的字符,例如“.”。 and '.'.和 '。'。 But a period (',') could appear in the middle of a sentence as in an abbreviation of a person's name.但是句号 (',') 可以出现在句子的中间,就像人名的缩写一样。 for example, I use a regular expression to look for a period followed by either a single space or the end of the string, which works for the first three sentences.例如,我使用正则表达式来查找后跟单个空格或字符串末尾的句点,这适用于前三个句子。 but not for any arbitrary sentence.但不是任意的句子。

import requests
from bs4 import BeautifulSoup
import re

url = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')

paragraphs = soup.select('section.article_text p')
sentences = []
for paragraph in paragraphs:
    matches = re.findall(r'(.+?[.!])(?: |$)', paragraph.text)
    needed = 3 - len(sentences)
    found = len(matches)
    n = min(found, needed)
    for i in range(n):
        sentences.append(matches[i])
    if len(sentences) == 3:
        break
print(sentences)

Prints:印刷:

['Many people will land on this page after learning that their email address has appeared in a data breach I\'ve called "Collection #1".', "Most of them won't have a tech background or be familiar with the concept of credential stuffing so I'm going to write this post for the masses and link out to more detailed material for those who want to go deeper.", "Let's start with the raw numbers because that's the headline, then I'll drill down into where it's from and what it's composed of."]

To scrape the first three sentences, just add these lines to ur code:要抓取前三个句子,只需将这些行添加到您的代码中:

section = soup.find('section',class_ = "article_text post") #Finds the section tag with class "article_text post"

txt = section.p.text #Gets the text within the first p tag within the variable section (the section tag)

print(txt)

Output: Output:

Many people will land on this page after learning that their email address has appeared in a data breach I've called "Collection #1". Most of them won't have a tech background or be familiar with the concept of credential stuffing so I'm going to write this post for the masses and link out to more detailed material for those who want to go deeper.

Hope that this helps!希望这会有所帮助!

Actually using beautify soup you can filter by the class "article_text post" seeing source code:实际上使用美化汤你可以通过class“article_text post”过滤查看源代码:

myData=soup.find('section',class_ = "article_text post")
print(myData.p.text)

And get the inner text of p element并获取p元素的内部文本

Use this instead of soup = BeautifulSoup(html_page, 'html.parser')用这个代替soup = BeautifulSoup(html_page, 'html.parser')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM