[英]Text missing after scraping a website using BeautifulSoup
I'm writing a python script to get the number of pull requests generated by a particular user during the ongoing hactoberfest event. 我正在编写一个python脚本,以获取特定用户在正在进行的hactoberfest事件期间生成的拉取请求的数量。 Here's a link to the official website of hacktoberfest .
这是hacktoberfest官方网站的链接 。
Here's my code: 这是我的代码:
url= 'https://hacktoberfest.digitalocean.com/stats/user'
import urllib.request
from bs4 import BeautifulSoup
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()
soup = BeautifulSoup(html, 'html.parser')
name_box = soup.find('div', attrs={'class': 'userstats--progress'})
print(name_box)
Where 'user' in the first line of the code should be replaced by the user's github handle ( eg. BAJUKA ). 代码第一行中的“用户”应替换为用户的github句柄(例如BAJUKA)。
Below is the HTML tag I'm aiming to scrape: 以下是我要抓取的HTML标签:
<div class="userstats--progress">
<p>
Progress (<span data-js="userPRCount">5</span>/5)
</p>
<div class="ProgressBar ProgressBar--three u-mb--regular ProgressBar--full" data-js="progressBar"></div>
</div>
This is what I get after I run my code: 这是我运行代码后得到的:
<div class="userstats--progress">
<p>
Progress (<span data-js="userPRCount"></span>/5)
</p>
<div class="ProgressBar ProgressBar--three u-mb--regular" data-js="progressBar"></div>
</div>
The difference being on the third line where the number of pull request in missing (ie in the span tag a 5 is missing) 区别在于第三行中缺少拉取请求的数量(即,span标签中缺少5)
These are the questions that I want to ask: 这些是我想问的问题:
1.Why is the no. 1.为什么没有 of pull requests ( ie 5 in this case ) missing from the scraped lines?
刮除线中缺少多少个拉取请求(即本例中为5个)?
2.How can I solve this issue? 2.如何解决这个问题? That is get the no.
那是没有。 of pull requests successfully.
请求成功。
The data you're looking for is not in the original data that the hacktober server sends, and Beautiful Soup downloads and parses; 您要查找的数据不在hacktober服务器发送的原始数据中,Beautiful Soup下载并解析; it's inserted into the HTML by the Javascript code that runs on that page in your browser after that original data is loaded.
加载原始数据后,通过浏览器在该页面上运行的Javascript代码将其插入HTML。
If you use this shell command to download the data that's actually served as the page, you'll see that the span tag you're looking at starts off empty: 如果您使用此shell命令下载实际用作页面的数据,则会看到您正在查看的span标签开始为空:
curl -s 'https://hacktoberfest.digitalocean.com/stats/BAJUKA' | grep -3 Progress
What's the javascript that fills that tag? 填充该标签的javascript是什么? Well, it's minified, so it's very hard to unpick what's going on.
好吧,它已缩小,因此很难清除正在发生的事情。 You can find it included in the very bottom of the original data, here:
您可以在以下原始数据的最底部找到它:
curl -s 'https://hacktoberfest.digitalocean.com/stats/BAJUKA' | grep -3 "script src=" | tail -n5
which when I run it, outputs this: 当我运行它时,输出以下内容:
<script src="https://go.digitalocean.com/js/forms2/js/forms2.min.js"></script> <script src="/assets/application-134859a20456d7d32be9ea1bc32779e87cad0963355b5372df99a0cff784b7f0.js"></script>
That crazy looking source URL is a minified piece of Javascript, which means that it's been automatically shrunk, which also means that it's almost unreadable. 看起来很疯狂的源URL是Javascript的精简版,这意味着它已被自动缩小,这也意味着它几乎不可读。 But if you go to that page.
但是,如果您转到该页面。 , and page right down to the bottom, you can see some garbled Javascript which you can try and decode.
,然后从下到下,您会看到一些乱码的Javascript,可以尝试对其进行解码。
I noticed this bit: 我注意到了这一点:
var d="2018-09-30T10%3A00%3A00%2B00%3A00",f="2018-11-01T12%3A00%3A00%2B00%3A00";$.getJSON("https://api.github.com/search/issues?q=-label:invalid+created:"+d+".."+f+"+type:pr+is:public+author:"+t+"&per_page=300"
Which I think is where it gets the data to fill that DIV. 我认为这是从那里获取数据以填充该DIV的地方。 If you load up and parse that URL, I think you'll find the data you need.
如果您加载并解析该 URL,我想您会找到所需的数据。 You'll need to fill in the dates for that search, and the author.
您需要填写该搜索的日期和作者。 Good luck!
祝好运!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.