简体   繁体   中英

Text missing after scraping a website using BeautifulSoup

I'm writing a python script to get the number of pull requests generated by a particular user during the ongoing hactoberfest event. Here's a link to the official website of hacktoberfest .
Here's my code:

url= 'https://hacktoberfest.digitalocean.com/stats/user'
import urllib.request
from bs4 import BeautifulSoup
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()
soup = BeautifulSoup(html, 'html.parser')
name_box = soup.find('div', attrs={'class': 'userstats--progress'})
print(name_box)

Where 'user' in the first line of the code should be replaced by the user's github handle ( eg. BAJUKA ).

Below is the HTML tag I'm aiming to scrape:

<div class="userstats--progress">
        <p>
          Progress (<span data-js="userPRCount">5</span>/5)
        </p>
          <div class="ProgressBar ProgressBar--three u-mb--regular ProgressBar--full" data-js="progressBar"></div>
      </div>

This is what I get after I run my code:

<div class="userstats--progress">
<p>
          Progress (<span data-js="userPRCount"></span>/5)
        </p>
<div class="ProgressBar ProgressBar--three u-mb--regular" data-js="progressBar"></div>
</div>

The difference being on the third line where the number of pull request in missing (ie in the span tag a 5 is missing)
These are the questions that I want to ask:
1.Why is the no. of pull requests ( ie 5 in this case ) missing from the scraped lines?
2.How can I solve this issue? That is get the no. of pull requests successfully.

The data you're looking for is not in the original data that the hacktober server sends, and Beautiful Soup downloads and parses; it's inserted into the HTML by the Javascript code that runs on that page in your browser after that original data is loaded.

If you use this shell command to download the data that's actually served as the page, you'll see that the span tag you're looking at starts off empty:

curl -s 'https://hacktoberfest.digitalocean.com/stats/BAJUKA' | grep -3 Progress

What's the javascript that fills that tag? Well, it's minified, so it's very hard to unpick what's going on. You can find it included in the very bottom of the original data, here:

curl -s 'https://hacktoberfest.digitalocean.com/stats/BAJUKA' | grep -3 "script src=" | tail -n5

which when I run it, outputs this:

 <script src="https://go.digitalocean.com/js/forms2/js/forms2.min.js"></script> <script src="/assets/application-134859a20456d7d32be9ea1bc32779e87cad0963355b5372df99a0cff784b7f0.js"></script> 

That crazy looking source URL is a minified piece of Javascript, which means that it's been automatically shrunk, which also means that it's almost unreadable. But if you go to that page. , and page right down to the bottom, you can see some garbled Javascript which you can try and decode.

I noticed this bit:

var d="2018-09-30T10%3A00%3A00%2B00%3A00",f="2018-11-01T12%3A00%3A00%2B00%3A00";$.getJSON("https://api.github.com/search/issues?q=-label:invalid+created:"+d+".."+f+"+type:pr+is:public+author:"+t+"&per_page=300"

Which I think is where it gets the data to fill that DIV. If you load up and parse that URL, I think you'll find the data you need. You'll need to fill in the dates for that search, and the author. Good luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM