简体   繁体   English

python 通过登录网站进行网页抓取

[英]python web-scraping through a login website

Looking for some help scraping a website that requires a login.寻求一些帮助来抓取需要登录的网站。 Essentially the website is to get trading card prices (that I believe are from ebay) but in a format that allows search beyond the 90 days that is on ebays site.本质上,该网站是获取交易卡价格(我相信来自 ebay),但其格式允许在 ebays 网站上搜索超过 90 天。 Login url is https://members.pwccmarketplace.com/login The url I search from is https://members.pwccmarketplace.com/ I searched the previous posts and found one I thought I could try replicate but to no success. Login url is https://members.pwccmarketplace.com/login The url I search from is https://members.pwccmarketplace.com/ I searched the previous posts and found one I thought I could try replicate but to no success. Below is the code, any help whether it could work or not would be appreciated.下面是代码,无论它是否可以工作,任何帮助都将不胜感激。

#https://stackoverflow.com/questions/47438699/scraping-a-website-with-python-3-that-requires-login
import requests
from lxml import html
from bs4 import BeautifulSoup
import unicodecsv as csv
import os
import sys
import io
import time
import datetime
from datetime import datetime
from datetime import date
import pandas as pd
import numpy as np
from time import sleep
from random import randint
from urllib.parse import quote

Product_name = []
Price = []
Date_sold = []

url = "https://www.pwccmarketplace.com/login"
values = {"email": "xyz@abc.com",
          "password": "password"}

session = requests.Session()

r = session.post(url, data=values)

Search_name = input("Search for: ")
Exclude_terms = input("Exclude these terms (- infront of all, no spaces): ")
qstr = quote(Search_name)
qstrr = quote(Exclude_terms)
Number_pages = int(input("Number of pages you want searched (Number -1): "))

pages = np.arange(1, Number_pages)

for page in pages:

    params = {"Category": 6, "deltreeid": 6, "do": "Delete Tree"}
    url = "https://www.pwccmarketplace.com/market-price-research?q=" + qstr + "+" + qstrr + "&year_min=2004&year_max=2020&price_min=0&price_max=10000&sort_by=date_desc&sale_type=auction&items_per_page=250&page=" + str(page)

    result = session.get(url, data=params)

    soup = BeautifulSoup(result.text, "lxml")

    search = soup.find_all('tr')

    sleep(randint(2,10))

    for container in search:

Code continues but not relevant to this question.代码继续,但与这个问题无关。

There is a token sent in the payload when you perform the POST https://members.pwccmarketplace.com/login .当您执行POST https://members.pwccmarketplace.com/login时,有效负载中会发送一个令牌。 This token is located in an input tag and can be scraped using beautifulsoup:此令牌位于input标签中,可以使用 beautifulsoup 刮取:

import requests
from bs4 import BeautifulSoup

session = requests.Session()

email = "your@email.com"
password = "your_password"

r = session.get("https://members.pwccmarketplace.com/login")

soup = BeautifulSoup(r.text, "html.parser")
token = soup.find("input", { "name": "_token"})["value"]

r = session.post(
    "https://members.pwccmarketplace.com/login",
    data = {
        "_token": token,
        "redirect": "",
        "email": email,
        "password": password,
        "remember": "true"
    }
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM