简体   繁体   English

Python网页抓取美丽汤

[英]Python web scrape with Beautiful Soup

I am able to scrape this sites tables with no issue; 我可以毫无问题地刮擦此站点表; however, to get access to the tables I customize I need to login first then scrape because if i do not i get a default output. 但是,要访问我自定义的表,我需要先登录然后抓取,因为如果没有,我将获得默认输出。 I feel like i am close, but I am relatively new to python. 我觉得我已经接近了,但是我对python还是比较陌生的。 Looking forward to learning more about mechanize and BeautifulSoup. 期待更多地了解机械化和BeautifulSoup。

It seems to be logging in correctly due to the fact that I get an "incorrect password" error if I purposely enter a wrong password below, but how do i connect the login to the url i want to scrape? 如果我故意在下面输入错误的密码,我会收到“错误的密码”错误,这似乎可以正确登录,但是如何将登录名连接到要抓取的网址?

from bs4 import BeautifulSoup
import urllib
import csv
import mechanize
import cookielib

cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.open("http://www.barchart.com/login.php")

br.select_form(nr=0)
br.form['email'] = 'username'
br.form['password'] = 'password'
br.submit()

#print br.response().read()

r = urllib.urlopen("http://www.barchart.com/stocks/sp500.php?view=49530&_dtp1=0").read()

soup = BeautifulSoup(r, "html.parser")

tables = soup.find("table", attrs={"class" : "datatable ajax"})

headers = [header.text for header in tables.find_all('th')]

rows = []

for row in tables.find_all('tr'):
    rows.append([val.text.encode('utf8') for val in row.find_all('td')])


with open('snp.csv', 'wb') as f:
    writer = csv.writer(f)
    writer.writerow(headers)
    writer.writerows(row for row in rows if row)

#from pymongo import MongoClient
#import datetime
#client = MongoClient('localhost', 27017)

print soup.table.get_text()

I am not sure that you actually need to login to retrieve the URL in your question; 我不确定您实际上是否需要登录才能检索到问题中的URL。 I get the same results whether logged in or not. 无论是否登录,我都会得到相同的结果。

However, if you do need to be logged in to access other data, the problem will be that you are logging in with mechanize , but then using urllib.urlopen() to access the page. 但是,如果确实需要登录才能访问其他数据,则问题将是您使用mechanize登录,然后使用urllib.urlopen()访问该页面。 There is no connection between the two, so any session data gathered by mechanize is not available to urlopen when it makes its request. 两者之间没有任何联系,因此, mechanize收集的任何会话数据在发出请求时都无法用于urlopen

In this case you don't need to use urlopen() because you can open the URL and access the HTML with mechanize : 在这种情况下,您无需使用urlopen()因为您可以打开URL并使用mechanize访问HTML:

r = br.open("http://www.barchart.com/stocks/sp500.php?view=49530&_dtp1=0")
soup = BeautifulSoup(r.read(), "html.parser")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM