简体   繁体   English

如何从链接中带有或不带有“.html”的 URL 链接收集文本数据?

[英]how to collect text data from a URL link with or without ".html" in the link?

I am trying to collect some text data from a URL like https://scikit-learn.org/stable/modules/linear_model.html .我正在尝试从像https://scikit-learn.org/stable/modules/linear_model.html这样的 URL 收集一些文本数据。

I would like to get the following text data from the html我想从 html 中获取以下文本数据

 1.1. Linear Models¶
 The following are a set of methods intended for regression in which the target value is 
 expected to be a linear combination of the features. In mathematical notation, if 
 is the predicted value.

My code:我的代码:

import urllib
from bs4 import BeautifulSoup
link = "https://scikit-learn.org/stable/modules/linear_model.html"
f = urllib.request.urlopen(link)
html = f.read()
soup = BeautifulSoup(html)
print(soup.prettify()) 

How to navigate into the embedded body of the html to get the above text data ?如何导航到 html 的嵌入主体以获取上述文本数据?

Also, I need to do the similar things for some links without ".html", I use the same code but no anything of the text data is returned from the link.另外,我需要为一些没有“.html”的链接做类似的事情,我使用相同的代码,但没有从链接返回任何文本数据。

I cannot see anything of the text data when I printed it out by当我打印出来时,我看不到任何文本数据

 print(soup.prettify())

The return status is退货状态是

  200

What could be the reason ?可能是什么原因 ?

thanks谢谢

When creating a BeautifulSoup object, you have to specify the parser that you want to use.创建BeautifulSoup对象时,您必须指定要使用的解析器。 Apart from that, I also recommend you to use requests instead of urllib , but it is completely your wish.除此之外,我还建议您使用requests而不是urllib ,但这完全是您的愿望。 Here is how you extract the text that you want:以下是提取所需文本的方法:

div = soup.find('div', class_ = "section") #Finds the div with class section

print(div.h1.text) #Prints the text within the first h1 tag within the div

print(div.p.text) #Prints the text within the first p tag within the div

Output:输出:

1.1. Linear Models¶
The following are a set of methods intended for regression in which
the target value is expected to be a linear combination of the features.
In mathematical notation, if \(\hat{y}\) is the predicted
value.

Here is the full code:这是完整的代码:

import urllib
from bs4 import BeautifulSoup
link = "https://scikit-learn.org/stable/modules/linear_model.html"
f = urllib.request.urlopen(link)
html = f.read()
soup = BeautifulSoup(html,'html5lib')

div = soup.find('div', class_ = "section")

print(div.h1.text)

print(div.p.text)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM