简体   繁体   中英

Python Web-scraping Solution

So, I'm new to python and am trying to develop an exercise in which I scrape the page numbers from a list on this url, which is a list of various published papers.

When I go into the HTML element for the page I want to scrape, I inspect the element and find this HTML code to match up:

<div class="src">
        Foreign Affairs, Vol. 79, No. 4 (Jul. - Aug., 2000), pp. 53-63
    </div>

The part that I want to churn out what is in between the class brackets. This is what I attempted to write in order to do the job.

import requests
from bs4 import BeautifulSoup

url = "http://www.jstor.org/action/doAdvancedSearch?c4=AND&c5=AND&q2=&pt=&q1=nuclear&f3=all&f1=all&c3=AND&c6=AND&q6=&f4=all&q4=&f0=all&c2=AND&q3=&acc=off&c1=AND&isbn=&q0=china+&f6=all&la=&f2=all&ed=2001&q5=&f5=all&group=none&sd=2000"
r = requests.get(url)
soup = BeautifulSoup(r.content)
links = soup.find_all("div class='src'")
for link in links:
    print 

I know that this code is unfinished and that's because I don't know where to go from here :/. Can anyone help me here?

An alternative to Tales Pádua's answer is this:

from bs4 import BeautifulSoup

html = """<div class="src">
    Foreign Affairs, Vol. 79, No. 4 (Jul. - Aug., 2000), pp. 53-63
</div>
<div class="src">
    Other Book, Vol. 1, No. 1 (Jul. - Aug., 2000), pp. 1-23
</div>"""
soup = BeautifulSoup(html)
links = soup.find_all("div", class_ = "src")
for link in links:
    print link.text.strip()

This outputs:

Foreign Affairs, Vol. 79, No. 4 (Jul. - Aug., 2000), pp. 53-63
Other Book, Vol. 1, No. 1 (Jul. - Aug., 2000), pp. 1-23

This answer uses the class_ parameter, which is recommended in the documentation.


If you are looking to get the page number, and everything follows the format above (comma separated), you can change the for loop to grab the last element of the string:

print link.text.split(",")[-1].strip()

This outputs:

pp. 53-63
pp. 1-23

If I understand you correctly, you want the pages inside all divs with class="src"

If so, then you need to do:

import requests
import re
from bs4 import BeautifulSoup

url = "http://www.jstor.org/action/doAdvancedSearch?c4=AND&c5=AND&q2=&pt=&q1=nuclear&f3=all&f1=all&c3=AND&c6=AND&q6=&f4=all&q4=&f0=all&c2=AND&q3=&acc=off&c1=AND&isbn=&q0=china+&f6=all&la=&f2=all&ed=2001&q5=&f5=all&group=none&sd=2000"
r = requests.get(url)
soup = BeautifulSoup(r.content)
links = soup.find_all('div', {'class':'src'})
for link in links:
    pages = re.search('(pp.\s*\d*-\d*)', link.text)
    print pages.group(1)

Note that I have used regex to get the page numbers. This may sound strange for people unfamiliar with regular expressions, but I think its more elegant than using string operations like strip and split

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM