简体   繁体   English

如何使用美丽的汤去下一页?

[英]How to go to next page using beautiful soup?

I have to extract information from 5 pages from a website. 我必须从网站的5个页面中提取信息。 At the end of every page there is "NEXT PAGE" button. 每页末尾都有“下一页”按钮。 this is the html code of the next button - 这是下一个按钮的html代码 -

<li class="pagination__next" data-reactid=".0.3.0.0.1.1.1.3.2">
    <span class="icon-arrowright-thin--pagination" data-reactid=".0.3.0.0.1.1.1.3.2.0">
        ::before
    </span>
</li>

I am using beautifulsoup4 to extract info. 我正在使用beautifulsoup4来提取信息。 How do I navigate to next page. 如何导航到下一页。 Can I use mechanize to navigate for this sort 我可以使用mechanize来导航这种类型

You can mimic the post to https://colleges.niche.com/entity-search/ but a much simpler way is to get the total number of pages from the first page then just loop in range 2 to number of pages. 您可以将帖子模仿到https://colleges.niche.com/entity-search/,但更简单的方法是从第一页获取页面总数,然后在范围2中循环到页面数。 All that gets added to the start url is &page=page_number : 添加到起始URL的所有内容都是&page = page_number

import requests
from bs4 import BeautifulSoup

start = "https://colleges.niche.com/?degree=4-year&sort=best"
url = "https://colleges.niche.com/?degree=4-year&sort=best&page={}"
soup = BeautifulSoup(requests.get(start).content)
pages = int(soup.select("select.pagination__pages__selector option")[-1].text.split(None, 1)[1])
print([a.text for a in soup.select("a.search__results__list__item__entity")])

for page in range(2, pages):
    soup = BeautifulSoup(requests.get(url.format(page)).content)
    print([a.text for a in soup.select("a.search__results__list__item__entity")])

If we run the code for a few iterations, you can see we get each page: 如果我们运行代码几次迭代,你可以看到我们得到每个页面:

In [1]: import requests
   ...: from bs4 import BeautifulSoup
   ...: start = "https://colleges.niche.com/?degree=4-year&sort=best"
   ...: url = "https://colleges.niche.com/?degree=4-year&sort=best&page={}"
   ...: soup = BeautifulSoup(requests.get(start).content, "html.parser")
   ...: pages = int(soup.select("select.pagination__pages__selector option")[-1]
   ...: .text.split(None, 1)[1])
   ...: print([a.text for a in soup.select("a.search__results__list__item__entit
   ...: y")])
   ...: for page in range(2, pages):
   ...:     soup = BeautifulSoup(requests.get(url.format(page)).content, "html.p
   ...: arser")
   ...:     print([a.text for a in soup.select("a.search__results__list__item__e
   ...: ntity")])
   ...:     
[u'Stanford University', u'Massachusetts Institute of Technology', u'Yale University', u'Harvard University', u'Princeton University', u'Rice University', u'Bowdoin College', u'University of Pennsylvania', u'Washington University in St. Louis', u'Brown University', u'Duke University', u'Columbia University', u'Dartmouth College', u'Vanderbilt University', u'Pomona College', u'California Institute of Technology', u'University of Southern California', u'University of Notre Dame', u'University of Chicago', u'Washington & Lee University', u'Carleton College', u'Colgate University', u'University of Michigan - Ann Arbor', u'Northwestern University', u'Tufts University']
[u'Williams College', u'Georgetown University', u'Amherst College', u'Cornell University', u'Thomas Jefferson University', u'University of Texas - Health Science Center at Houston', u'Barnard College', u'Haverford College', u'Carnegie Mellon University', u'Emory University', u'University of California - Los Angeles', u'Harvey Mudd College', u'Medical University of South Carolina', u'Franklin W. Olin College of Engineering', u'Claremont McKenna College', u'Middlebury College', u'Swarthmore College', u'Bates College', u'University of Virginia', u'University of Texas - Austin', u'University of California - Berkeley', u'Virginia Tech', u'University of North Carolina at Chapel Hill', u'University of Texas - Medical Branch at Galveston', u'Davidson College']
[u'Colby College', u'Hamilton College', u'Samuel Merritt University', u'Georgia Institute of Technology', u'University of Richmond', u'Lehigh University', u'Grinnell College', u'Northeastern University', u'University of Illinois at Urbana-Champaign', u'New York University', u'University of Wisconsin', u'Wake Forest University', u'Reed College', u'Bucknell University', u'Oregon Health & Science University', u'Johns Hopkins University', u'Lafayette College', u'University of Texas - Health Science Center at San Antonio', u'Smith College', u'Wellesley College', u'University of Rochester', u'Scripps College', u'College of William & Mary', u'University of Florida', u'The Curtis Institute of Music']
[u'United States Coast Guard Academy', u'College of the Holy Cross', u'Penn State', u'Bryn Mawr College', u'Wesleyan University', u'Ohio State University', u'Colorado School of Mines', u'Texas A&M University', u'University of Maryland - Baltimore', u'Purdue University', u'University of California - Santa Barbara', u'University of Georgia', u'University of Miami', u'Tulane University', u'University of Tulsa', u'Boston College', u'The Juilliard School', u'Texas Tech University Health Sciences Center', u'Worcester Polytechnic Institute', u'Franklin & Marshall College', u'Brigham Young University', u'Southern Methodist University', u'Mount Holyoke College', u'Kenyon College', u'University of Washington']

If you were to mimic the post, the following would work. 如果您要模仿帖子,以下内容将起作用。 Depending on what data you want this actually may be preferable as you get json back: 根据您想要的数据,当您获得json时,这实际上可能更可取:

import requests
from bs4 import BeautifulSoup

start = "https://colleges.niche.com/?degree=4-year&sort=best"
post = "https://colleges.niche.com/entity-search/"

data = {"degreeType": ["4-year"], "sort": "best", "page": 1, "vertical": "colleges"}

soup = BeautifulSoup(requests.get(start).content, "html.parser")
pages = int(soup.select("select.pagination__pages__selector option")[-1].text.split(None, 1)[1])
for page in range(1, pages+ 1):
    data["page"] = page
    r = requests.post(post, json=data)
    print(r.json())

That gives you data like: 这给你的数据如下:

{u'count': 2854, u'results': [{u'reviewCount': 258, u'netPrice': 20315, u'reviewAvg': 3.7713178294573644, u'totalStudents': 2034, u'grade': 4.33, u'tagline': u'4 Year &middot; Williamstown, MA', u'SATRange': u'1350-1560', u'label': u'Williams College', u'url': u'https://colleges.niche.com/williams-college/', u'ACTRange': u'31-34', u'location': {u'lat': 42.7117, u'lng': -73.2059}, u'guid': u'465D4A73-875C-498E-9C8F-E47568E156F2', u'type': u'College'}, {u'reviewCount': 1081, u'netPrice': 25786, u'reviewAvg': 3.698427382053654, u'totalStudents': 7226, u'grade': 4.33, u'tagline': u'4 Year &middot; Washington, DC', u'SATRange': u'1320-1520', u'label': u'Georgetown University', u'url': u'https://colleges.niche.com/georgetown-university/', u'ACTRange': u'30-33', u'location': {u'lat': 38.9088, u'lng': -77.0735}, u'guid': u'34AF6312-6F20-4D90-B512-AC5CD720AB25', u'type': u'College'}, {u'reviewCount': 247, u'netPrice': 14687, u'reviewAvg': 3.8259109311740893, u'totalStudents': 1792, u'grade': 4.33, u'tagline': u'4 Year &middot; Amherst, MA', u'SATRange': u'1350-1548', u'label': u'Amherst College', u'url': u'https://colleges.niche.com/amherst-college/', u'ACTRange': u'30-34', u'location': {u'lat': 42.3725, u'lng': -72.5185}, u'guid': u'127EC524-4BAC-4A5C-A7F5-1EAD9C309F44', u'type': u'College'}, {u'reviewCount': 1730, u'netPrice': 28537, u'reviewAvg': 3.654913294797688, u'totalStudents': 14269, u'grade': 4.33, u'tagline': u'4 Year &middot; Ithaca, NY', u'SATRange': u'1330-1510', u'label': u'Cornell University', u'url': u'https://colleges.niche.com/cornell-university/', u'ACTRange': u'30-34', u'location': {u'lat': 42.4453, u'lng': -76.4827}, u'guid': u'C35E497B-10BC-4482-92E5-F27941433B02', u'type': u'College'}, {u'reviewCount': 254, u'netPrice': None, u'reviewAvg': 3.8149606299212597, u'totalStudents': 649, u'grade': 4.33, u'tagline': u'4 Year &middot; Philadelphia, PA', u'SATRange': None, u'label': u'Thomas Jefferson University', u'url': u'https://colleges.niche.com/thomas-jefferson-university/', u'ACTRange': None, u'location': {u'lat': 39.9491, u'lng': -75.1581}, u'guid': u'E8C9EBC6-90C5-4CDF-A324-2CCE16060B61', u'type': u'College'}, {u'reviewCount': 131, u'netPrice': None, u'reviewAvg': 3.740458015267176, u'totalStudents': 539, u'grade': 4.33, u'tagline': u'4 Year &middot; Houston, TX', u'SATRange': None, u'label': u'University of Texas - Health Science Center at Houston', u'url': u'https://colleges.niche.com/university-of-texas----health-science-center-at-houston/', u'ACTRange': None, u'location': {u'lat': 29.7029, u'lng': -95.4032}, u'guid': u'43EEDD7D-8204-4014-961B-BEDDBD4C6417', u'type': u'College'}, {u'reviewCount': 390, u'netPrice': 21791, u'reviewAvg': 3.776923076923077, u'totalStudents': 2537, u'grade': 4.33, u'tagline': u'4 Year &middot; New York, NY', u'SATRange': u'1250-1440', u'label': u'Barnard College', u'url': u'https://colleges.niche.com/barnard-college/', u'ACTRange': u'28-32', u'location': {u'lat': 40.8091, u'lng': -73.964}, u'guid': u'DD4FCD82-8E4E-4F4C-A7DC-FADCEBB49681', u'type': u'College'}, {u'reviewCount': 190, u'netPrice': 22409, u'reviewAvg': 3.789473684210526, u'totalStudents': 1189, u'grade': 4.33, u'tagline': u'4 Year &middot; Haverford, PA', u'SATRange': u'1330-1490', u'label': u'Haverford College', u'url': u'https://colleges.niche.com/haverford-college/', u'ACTRange': u'31-34', u'location': {u'lat': 40.0134, u'lng': -75.3026}, u'guid': u'271075B3-07A0-450B-B4F3-78EB1FC7C03A', u'type': u'College'}, {u'reviewCount': 1310, u'netPrice': 33670, u'reviewAvg': 3.6068702290076335, u'totalStudents': 5699, u'grade': 4.33, u'tagline': u'4 Year &middot; Pittsburgh, PA', u'SATRange': u'1340-1540', u'label': u'Carnegie Mellon University', u'url': u'https://colleges.niche.com/carnegie-mellon-university/', u'ACTRange': u'30-34', u'location': {u'lat': 40.4446, u'lng': -79.9429}, u'guid': u'D8A17C0F-CC25-4D2A-B231-0303EA016427', u'type': u'College'}, {u'reviewCount': 1392, u'netPrice': 28203, u'reviewAvg': 3.757183908045977, u'totalStudents': 7732, u'grade': 4.33, u'tagline': u'4 Year &middot; Atlanta, GA', u'SATRange': u'1280-1460', u'label': u'Emory University', u'url': u'https://colleges.niche.com/emory-university/', u'ACTRange': u'29-32', u'location': {u'lat': 33.7988, u'lng': -84.3258}, u'guid': u'86AD5853-ED72-4EFD-855C-4746FF698941', u'type': u'College'}, {u'reviewCount': 4465, u'netPrice': 12510, u'reviewAvg': 3.838521836506159, u'totalStudents': 29033, u'grade': 4.33, u'tagline': u'4 Year &middot; Los Angeles, CA', u'SATRange': u'1190-1460', u'label': u'University of California - Los Angeles', u'url': u'https://colleges.niche.com/university-of-california----los-angeles/', u'ACTRange': u'27-33', u'location': {u'lat': 34.0689, u'lng': -118.444}, u'guid': u'1D1D82CF-C659-49F0-A526-7AFB85BD3A4F', u'type': u'College'}, {u'reviewCount': 122, u'netPrice': 33137, u'reviewAvg': 3.6639344262295084, u'totalStudents': 802, u'grade': 4.33, u'tagline': u'4 Year &middot; Claremont, CA', u'SATRange': u'1418-1570', u'label': u'Harvey Mudd College', u'url': u'https://colleges.niche.com/harvey-mudd-college/', u'ACTRange': u'33-35', u'location': {u'lat': 34.1061, u'lng': -117.711}, u'guid': u'20D662BE-8428-4DE2-BF0D-72D22F0A04B5', u'type': u'College'}, {u'reviewCount': 71, u'netPrice': None, u'reviewAvg': 4.014084507042253, u'totalStudents': 281, u'grade': 4.33, u'tagline': u'4 Year &middot; Charleston, SC', u'SATRange': None, u'label': u'Medical University of South Carolina', u'url': u'https://colleges.niche.com/medical-university-of-south-carolina/', u'ACTRange': None, u'location': {u'lat': 32.786, u'lng': -79.9469}, u'guid': u'7CD7C977-D16A-4399-8D7E-3B1FA0DFAB7D', u'type': u'College'}, {u'reviewCount': 115, u'netPrice': 29979, u'reviewAvg': 4.095652173913043, u'totalStudents': 350, u'grade': 4.33, u'tagline': u'4 Year &middot; Needham, MA', u'SATRange': u'1410-1550', u'label': u'Franklin W. Olin College of Engineering', u'url': u'https://colleges.niche.com/franklin-w-olin-college-of-engineering/', u'ACTRange': u'32-34', u'location': {u'lat': 42.2928, u'lng': -71.264}, u'guid': u'88A3438F-9304-481E-8022-0AE353991161', u'type': u'College'}, {u'reviewCount': 399, u'netPrice': 23982, u'reviewAvg': 3.87468671679198, u'totalStudents': 1298, u'grade': 4.33, u'tagline': u'4 Year &middot; Claremont, CA', u'SATRange': u'1350-1520', u'label': u'Claremont McKenna College', u'url': u'https://colleges.niche.com/claremont-mckenna-college/', u'ACTRange': u'30-33', u'location': {u'lat': 34.1023, u'lng': -117.707}, u'guid': u'DAE7241A-4D00-4C50-B1A5-F33BAF3A6C3B', u'type': u'College'}, {u'reviewCount': 458, u'netPrice': 20903, u'reviewAvg': 3.7139737991266375, u'totalStudents': 2492, u'grade': 4.33, u'tagline': u'4 Year &middot; Middlebury, VT', u'SATRange': u'1260-1470', u'label': u'Middlebury College', u'url': u'https://colleges.niche.com/middlebury-college/', u'ACTRange': u'30-33', u'location': {u'lat': 44.0091, u'lng': -73.1761}, u'guid': u'0E72BF23-A3CF-4995-9585-33B5BD0F9222', u'type': u'College'}, {u'reviewCount': 401, u'netPrice': 22557, u'reviewAvg': 3.56857855361596, u'totalStudents': 1534, u'grade': 4.33, u'tagline': u'4 Year &middot; Swarthmore, PA', u'SATRange': u'1360-1540', u'label': u'Swarthmore College', u'url': u'https://colleges.niche.com/swarthmore-college/', u'ACTRange': u'29-34', u'location': {u'lat': 39.9041, u'lng': -75.3561}, u'guid': u'891F20E2-4B6F-4626-83F3-15D502B2E7C1', u'type': u'College'}, {u'reviewCount': 320, u'netPrice': 22062, u'reviewAvg': 3.878125, u'totalStudents': 1773, u'grade': 4.33, u'tagline': u'4 Year &middot; Lewiston, ME', u'SATRange': None, u'label': u'Bates College', u'url': u'https://colleges.niche.com/bates-college/', u'ACTRange': None, u'location': {u'lat': 44.1053, u'lng': -70.2033}, u'guid': u'2C036559-5EBB-4C00-B3B8-6679A91FB040', u'type': u'College'}, {u'reviewCount': 1995, u'netPrice': 14069, u'reviewAvg': 3.800501253132832, u'totalStudents': 15622, u'grade': 4.33, u'tagline': u'4 Year &middot; Charlottesville, VA', u'SATRange': u'1250-1460', u'label': u'University of Virginia', u'url': u'https://colleges.niche.com/university-of-virginia/', u'ACTRange': u'28-33', u'location': {u'lat': 38.0365, u'lng': -78.5026}, u'guid': u'9EA86CB5-E8A6-47E6-A219-FDCABC31AE51', u'type': u'College'}, {u'reviewCount': 5513, u'netPrice': 16832, u'reviewAvg': 3.8824596408489027, u'totalStudents': 36309, u'grade': 4.33, u'tagline': u'4 Year &middot; Austin, TX', u'SATRange': u'1170-1410', u'label': u'University of Texas - Austin', u'url': u'https://colleges.niche.com/university-of-texas----austin/', u'ACTRange': u'26-32', u'location': {u'lat': 30.2847, u'lng': -97.7373}, u'guid': u'BC90E2B6-E112-43ED-AC5C-3548829EA3DD', u'type': u'College'}, {u'reviewCount': 3718, u'netPrice': 16655, u'reviewAvg': 3.5922538999462077, u'totalStudents': 26320, u'grade': 4.33, u'tagline': u'4 Year &middot; Berkeley, CA', u'SATRange': u'1240-1500', u'label': u'University of California - Berkeley', u'url': u'https://colleges.niche.com/university-of-california----berkeley/', u'ACTRange': u'29-34', u'location': {u'lat': 37.8715, u'lng': -122.26}, u'guid': u'09E8CD9A-F401-4C8B-A79C-F02E10AC0201', u'type': u'College'}, {u'reviewCount': 3382, u'netPrice': 18398, u'reviewAvg': 3.8793613246599645, u'totalStudents': 23685, u'grade': 4.33, u'tagline': u'4 Year &middot; Blacksburg, VA', u'SATRange': u'1110-1320', u'label': u'Virginia Tech', u'url': u'https://colleges.niche.com/virginia-tech/', u'ACTRange': None, u'location': {u'lat': 37.2286, u'lng': -80.4233}, u'guid': u'EEB0E829-996A-45B1-9671-3EF4AF096423', u'type': u'College'}, {u'reviewCount': 2138, u'netPrice': 10936, u'reviewAvg': 3.7787652011225443, u'totalStudents': 17570, u'grade': 4.33, u'tagline': u'4 Year &middot; Chapel Hill, NC', u'SATRange': u'1220-1420', u'label': u'University of North Carolina at Chapel Hill', u'url': u'https://colleges.niche.com/university-of-north-carolina-at-chapel-hill/', u'ACTRange': u'28-32', u'location': {u'lat': 35.9122, u'lng': -79.051}, u'guid': u'5712B0C1-3A40-4EA1-A324-9C4F76FEFD10', u'type': u'College'}, {u'reviewCount': 110, u'netPrice': None, u'reviewAvg': 3.8545454545454545, u'totalStudents': 586, u'grade': 4.33, u'tagline': u'4 Year &middot; Galveston, TX', u'SATRange': None, u'label': u'University of Texas - Medical Branch at Galveston', u'url': u'https://colleges.niche.com/university-of-texas----medical-branch-at-galveston/', u'ACTRange': None, u'location': {u'lat': 29.3113, u'lng': -94.7764}, u'guid': u'5FEEDB69-A566-4671-B821-28304A74F474', u'type': u'College'}, {u'reviewCount': 264, u'netPrice': 22457, u'reviewAvg': 3.8333333333333335, u'totalStudents': 1770, u'grade': 4.33, u'tagline': u'4 Year &middot; Davidson, NC', u'SATRange': u'1230-1440', u'label': u'Davidson College', u'url': u'https://colleges.niche.com/davidson-college/', u'ACTRange': u'28-32', u'location': {u'lat': 35.5, u'lng': -80.8452}, u'guid': u'1AD50A05-6325-4392-B428-A08C944E61EF', u'type': u'College'}], u'page': 1, u'pageSize': 25, u'pageCount': 40}

Which probably includes dynamically created content that you would not get in the source returned. 其中可能包含动态创建的内容,您不会在返回的源中获得这些内容。

For the reviews url https://colleges.niche.com/williams-college/reviews , you need to parse a token from the source then do a post exactly like before: 对于评论网址https://colleges.niche.com/williams-college/reviews ,您需要解析来自源代码的令牌,然后执行与以前完全相同的帖子:

import requests
import re

patt = re.compile('"entityGuid":"(.*?)"')
url = "https://colleges.niche.com/williams-college/reviews/"
soup = BeautifulSoup(requests.get(url).content)
data_tag = patt.search(soup.select_one("#dataLayerTag").text).group(1)
params = {"e": data_tag, "page": 2, "limit": "20"}
url = "https://niche.com/api/entity-reviews/"
resp = requests.get(url, params=params)
print(resp.json())

Which gives you: 哪个给你:

{u'reviews': [{u'body': u'I enjoy being in classes here, but the work gets overwhelming. People are great but very cliquy.', u'rating': 4, u'guid': u'35b6faeb-95b2-4385-b3ee-19e6c7984e1b', u'created': u'2016-04-20T22:24:56Z', u'author': u'College Sophomore'}, {u'body': u'The alumni network is great. Easy to use. But the career center sucks.', u'rating': 4, u'guid': u'beddcae1-d860-4a8a-a431-45bf7e7087e6', u'created': u'2016-04-20T22:24:56Z', u'author': u'College Sophomore'}, {u'body': u"It's hard for sophomores to get good housing. Even as a senior, the good housings are far away from campus. But almost everyone has singles, even freshman.", u'rating': 3, u'guid': u'fff99560-0b4f-499d-a95b-7b3b3f9826f0', u'created': u'2016-04-20T22:19:27Z', u'author': u'College Sophomore'}, {u'body': u"We don't have greek life.", u'rating': 1, u'guid': u'69e60cf0-ff3c-4b34-acf1-6315d878c205', u'created': u'2016-04-20T22:17:35Z', u'author': u'College Sophomore'}, {u'body': u"There's not a lot of team spirit here. Athletes are nice, but they tend to hang among themselves.", u'rating': 3, u'guid': u'b31ee366-1b68-4c0f-b262-ff628243887c', u'created': u'2016-04-20T22:17:02Z', u'author': u'College Sophomore'}, {u'body': u'Williams offer a lot of chances to study abroad, but the social scene is very very limited.', u'rating': 4, u'guid': u'11a3feb2-21fa-45d9-8ee0-e6e1e8cea0c0', u'created': u'2016-04-20T22:15:35Z', u'author': u'College Sophomore'}, {u'body': u"Most people will live on campus all four years. It's not a bad deal!", u'rating': 4, u'guid': u'4a845124-7cfd-4059-8d63-cb1d414ce0cc', u'created': u'2016-04-08T13:58:30Z', u'author': u'College Senior'}, {u'body': u'The facilities have everything you could need as a varsity or non-varsity athlete. With our new football/lacrosse field and track, we have it made! Still, with an active there is always competition for prime field time, and IM sports are relegated either to early/late hours or ungroomed fields.', u'rating': 4, u'guid': u'31c89c4d-91ee-4b92-a198-3e12c304d7e1', u'created': u'2016-04-08T13:55:12Z', u'author': u'College Senior'}, {u'body': u'I have loved my time at Williams! The best part of my experience has been the people here, and as a senior trying to figure out post graduate plans, I am comforted by the willingness to help and commitment to the College from alumni. Go Ephs!', u'rating': 4, u'guid': u'4458ed87-4183-4784-908a-6ae67582e82c', u'created': u'2016-04-08T13:51:51Z', u'author': u'College Senior'}, {u'body': u'Could be better but overall good.', u'rating': 4, u'guid': u'08327955-2698-4fe6-ac1f-13108327cc21', u'created': u'2016-01-01T22:51:16Z', u'author': u'College Junior'}, {u'body': u'Better this year than past years.', u'rating': 3, u'guid': u'1892de02-eb45-42b5-b728-34912499e5eb', u'created': u'2016-01-01T22:43:54Z', u'author': u'College Junior'}, {u'body': u'Could have better facilities. Otherwise, great.', u'rating': 4, u'guid': u'2dc48cb2-d21f-4fd6-a9c7-19a5e513e6d6', u'created': u'2016-01-01T22:40:45Z', u'author': u'College Junior'}, {u'body': u'Awesome experience. Very community-oriented school. I love this place. Great people. Everyone wants to help you, the professors are amazing.', u'rating': 5, u'guid': u'5fa28a31-9391-4db7-b70d-5e2aa58708b3', u'created': u'2016-01-01T22:39:06Z', u'author': u'College Junior'}, {u'body': u"Williams has been the perfect place for me. My professors have been incredible mentors--I've gone to three professors' houses for dinner. The location is beautiful, and perfect for focusing on academics. I've been able to get very involved in all my clubs and really find what makes me passionate. But best of all is the people. They're all smart and talented and wonderful. I am so lucky.", u'rating': 5, u'guid': u'81ff499b-4721-4625-bee1-acf1e9b21916', u'created': u'2015-08-25T13:08:28Z', u'author': u'College Junior'}, {u'body': u"I don't know much, only seniors can live off campus.", u'rating': 3, u'guid': u'd9dc2e2f-a08d-4a01-8fe2-410623f93d7a', u'created': u'2015-04-27T19:31:06Z', u'author': u'College Freshman'}, {u'body': u"Everything closes really early, but there's some good food. No chains really.", u'rating': 3, u'guid': u'5993a99e-a936-40c8-ae0d-4581c8d089ef', u'created': u'2015-04-27T19:30:01Z', u'author': u'College Freshman'}, {u'body': u"It's kind of sad. There's never more than a handful of things happening on fridays or satudays and there's nothing for the rest of the week", u'rating': 3, u'guid': u'65c83983-2f6f-4b08-b870-06c35fd2b0e9', u'created': u'2015-04-27T19:27:34Z', u'author': u'College Freshman'}, {u'body': u"Having visitors is pretty easy. One of the officers is the worst but otherwise they're generally lenient about weed and alcohol.", u'rating': 4, u'guid': u'bcd95788-22b7-4a23-b942-2493206d1734', u'created': u'2015-04-27T19:21:34Z', u'author': u'College Freshman'}, {u'body': u"They usually give you a good package, but a lot of it is work-study and students don't have the free time for that here.", u'rating': 3, u'guid': u'1a87483c-952c-479b-9a57-65fb09895e75', u'created': u'2015-04-27T19:19:35Z', u'author': u'College Freshman'}, {u'body': u"Food is kind of repetitive. Pretty much all the kitchens are very wasteful. We can't use meal plans anywhere off campus.", u'rating': 3, u'guid': u'361b725f-bedc-4452-843d-5dc284c18dcd', u'created': u'2015-04-27T19:17:22Z', u'author': u'College Freshman'}], u'total': 246, u'limit': 20, u'page': 2}

You should be able to figure that rest out yourself based on the other parts to the answer. 您应该能够根据答案的其他部分找出自己的休息时间。

If the "next page" involves javascript, then yes, you can only mechanize. 如果“下一页”涉及javascript,那么是的,你只能机械化。 You can do it with selenium 你可以用硒来做

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait

client = webbrowser.get('firefox')
browser = webdriver.Chrome('./chromedriver')

url = "www.example.com"
browser.get(url)
###### Wait until you see some element that signals the page is completely loaded
WebDriverWait(browser, timeout=10).until(lambda x: x.find_element_by_class_name('Even'))

############## do your things with the first page
content =  browser.page_source.encode('ascii','ignore').decode("utf-8")


#### Now if you are sure there is next page
next_button_class = 'icon-arrowright-thin--pagination' ###here insert the class of 'next button'
browser.find_element_by_class_name(next_button_class).click()
time.sleep(3)

###### Wait until you see some element that signals the page is completely loaded
WebDriverWait(browser, timeout=10).until(lambda x: x.find_element_by_class_name('Even'))

content =  browser.page_source.encode('ascii','ignore').decode("utf-8")

BeautifulSoup is an HTML parser, not a web browser, it can't navigation or download pages. BeautifulSoup是一个HTML解析器,不是Web浏览器,它无法导航或下载页面。 For that you'd typically use an HTTP library like urllib or request to fetch the HTML from a particular URL in order to feed it to BeautifulSoup. 为此,您通常使用类似urllib的HTTP库或request从特定URL获取HTML以将其提供给BeautifulSoup。 In your case, mechanize could be used to do this. 在您的情况下, mechanize可以用来做到这一点。

Unfortunately, the HTML supplied from your pagination button isn't a link, so it doesn't have an href attribute. 不幸的是,从您的分页按钮提供的HTML不是链接,因此它没有href属性。 If it did, you'd be easily able to parse the URL from it and tell your HTTP library to go fetch it. 如果是这样,您可以轻松地从中解析URL并告诉您的HTTP库去获取它。

Instead, you'll need to use mechanize to simulate a click event on that button, wait a short amount of time, then assume that the new page has loaded and then pass the resulting HTML to BeautifulSoup. 相反,您需要使用mechanize来模拟该按钮上的click事件,等待很短的时间,然后假设新页面已加载,然后将生成的HTML传递给BeautifulSoup。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM