简体   繁体   English

如何网站抓取

[英]How to webscrape this website

I have a website here 我在这里有一个网站

BSE SmallCap 疯牛病小型股

About 100 companies are listed here. 这里列出了大约100家公司。 How can I save the next 100 companies programmatically using Python(or C#). 如何使用Python(或C#)以编程方式保存下100家公司。 On the bottom of this page 在此页面的底部

Showing 1 - 100 of 528 << Previous | 显示1-100 of 528 <<上一页|下一页 Next >> 下一个>>

is seen.How can I access the link 可见。如何访问链接

Next >> 下一个>>

programmatically.This link is seen as base url + '#'(http://money.rediff.com/indices/bse/bsesmallcap#). 该链接被视为基本URL +'#'(http://money.rediff.com/indices/bse/bsesmallcap#)。 How to save all the 1-528 company details(as seperate webpages:1-100,101-200 etc). 如何保存1-528公司的所有详细信息(作为单独的网页:1-100,101-200等)。 Is there any special tailormade programs for these kind of tasks. 是否有针对这些任务的特殊量身定制的程序。

You don't even need scrapy or anything like that--there's no link to find with that "Next" link, since it's actually javascript: 您甚至不需要刮y之类的东西-找不到带有该“下一个”链接的链接,因为它实际上是javascript:

javascript:nextPage(document.paging.totalPages.value)

I used Chrome's developer tools to see what request it was actually making, and it turns out it's just a simple unauthenticated POST request. 我使用Chrome的开发人员工具查看其实际发出的请求,结果证明这只是一个简单的未经身份验证的POST请求。 You can get any page you want with the following: 您可以通过以下方式获取所需的任何页面:

import requests
r = requests.post('http://money.rediff.com/indices/bse/bsesmallcap',
              data={'currentPageNo': 3, 'RowPerPage': 100})
print r.text

All you have to do is change the 'currentPageNo' argument to get whichever page you're looking for. 您要做的就是更改'currentPageNo'参数以获取您要查找的页面。 You could probably also change the number of rows per page, but I didn't experiment with that. 您可能还可以更改每页的行数,但是我没有对此进行试验。 Update: You can't; 更新:您不能; I tried. 我试过了。

In terms of actually saving the information, you can use BeautifulSoup to grab the data from each request and store it or save it. 在实际保存信息方面,您可以使用BeautifulSoup从每个请求中获取数据并进行存储或保存。 Given that the table regularly has the 'dataTable' class on each page, it's pretty easy to find. 鉴于该表在每个页面上定期都有'dataTable'类,因此很容易找到它。 So, given that there are 6 pages, you'd end up with code that looks something like: 因此,鉴于有6页,您最终将获得类似于以下内容的代码:

import requests
from bs4 import BeautifulSoup as BS
for page in range(1, 7):
    r = requests.post('http://money.rediff.com/indices/bse/bsesmallcap',
                      data={'currentPageNo': page, 'RowPerPage': 100})
    soup = BS(r.text)
    table = soup.find(class_='dataTable')
    # Add table information to whatever output you plan to use

The full link to "each page" is: http://money.rediff.com/indices/bse/bsesmallcap&cTab=12&sortBy=&sortDesc=&pageType=indices_wise&currentPageNo=1&RowPerPage=100&bTab=12 指向“每个页面”的完整链接是: http : //money.rediff.com/indices/bse/bsesmallcap&cTab=12&sortBy=&sortDesc=&pageType=indices_wise&currentPageNo=1&RowPerPage=100&bTab=12

(I've removed the totalPages aspect, since you'll need to scrape this bit yourself) (我已删除了totalPages方面,因为您需要自己抓取这一部分)

Once you know the number of pages (from scraping), you can increment the currentPageNo until you have all the rows. 一旦知道了页面的数量(通过抓取),就可以递增currentPageNo直到拥有所有行。

You can increase the number of RowsPerPage, but there seems to be an internal limit of 200 rows (even if you change it to say, 500) 您可以增加RowsPerPage的数量,但是内部限制似乎是200行(即使将其更改为500行)

A spin on jdotjdot's answer using PyQuery instead of BeautifulSoup , I like it for the jQuery-esque notation for traversing. 使用PyQuery而不是BeautifulSoup来旋转jdotjdot的答案,我喜欢遍历jQuery风格的表示法。 It will use urllib by default or requests for scraping. 默认情况下,它将使用urllib或进行抓取requests

from pyquery import PyQuery as pq
for page in range(1, 3):
    # POST request
    d = pq(url="http://money.rediff.com/indices/bse/bsesmallcap",
           data={"currentPageNo": page, "RowPerPage": 50},
           method="post")
    # jQuery-esque notation for selecting elements
    d("table.dataTable").text()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM