简体   繁体   English

使用Python从网站检索源代码

[英]Retrieving source code from a website with Python

I've been trying to extract links from a website with no luck. 我一直在尝试从网站中提取链接而没有运气。 From what I've read it can be done easily, but the links are with a pop-up dialog within the website. 从我所读的内容中可以很容易地做到这一点,但是链接带有网站内的弹出对话框。 The only way I can grab the links would be to ctrl-A and view the source to copy. 我可以获取链接的唯一方法是ctrl-A并查看要复制的源。

Is there a way to Select all before grabbing the entire content? 是否有办法在获取全部内容之前全选?

Appreciate any information or pointers! 感谢任何信息或指针!

EDIT I would like to avoid downloading anything beyond what python already has, eg BS/Scrapy etc. 编辑我想避免下载python已经没有的东西,例如BS / Scrapy等。

As far as retrieving links are concerned, it can be done using requests and bs4 as suggested by jonrsharpe. 据检索链接而言,可以用做请求BS4由jonrsharpe的建议。 I am excited in answering this because, i wrote one of them 1 or 2 days ago. 我很高兴回答这个问题,因为我在1或2天前写了其中一篇。

from sys import argv
import requests
from bs4 import BeautifulSoup
#from notify2 import notify2
from time import sleep
import notify2
def send_message(title, message):
    notify2.init("Init")
    notice = notify2.Notification(title, message)
    notice.show()
    return

url = "http://stackoverflow.com/feeds/tag?tagnames=%s&sort=newest" % argv[1]
while True:
    r = requests.get(url)
    while r.status_code is not 200:
            r = requests.get(url)
    soup = BeautifulSoup(r.text)
    data = soup.find_all("link")
    question = data[2].get('href')
    question = question[question.find('questions') + 19:]
    send_message("Question %s: " % argv[1].upper(), question)
    sleep(60)

Basically, this a script which gives you desktop notifications every 1 min. 基本上,该脚本可以每1分钟向您发送一次桌面通知。 The data which is shown is the first question of specified tag of stackoverflow (in most of the cases it works just fine, you have to check for correct url in others) 显示的数据是指定的stackoverflow标签第一个问题 (在大多数情况下,它工作得很好,您必须在其他情况下检查网址是否正确)
Here, you can reach a url and get all the data using requests.get() and parse those data using different methods given by bs4 在这里,您可以访问URL并使用bs4 requests.get()获取所有数据,并使用bs4提供的不同方法来解析这些数据
By the way, the repo of this code is here . 顺便说一句,此代码的存储库在此处 Any contributions would be appreciated. 任何贡献将不胜感激。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM