简体   繁体   English

使用BeautifulSoup抓取网站后缺少文本

[英]Text missing after scraping a website using BeautifulSoup

I'm writing a python script to get the number of pull requests generated by a particular user during the ongoing hactoberfest event. 我正在编写一个python脚本,以获取特定用户在正在进行的hactoberfest事件期间生成的拉取请求的数量。 Here's a link to the official website of hacktoberfest . 这是hacktoberfest官方网站的链接
Here's my code: 这是我的代码:

url= 'https://hacktoberfest.digitalocean.com/stats/user'
import urllib.request
from bs4 import BeautifulSoup
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()
soup = BeautifulSoup(html, 'html.parser')
name_box = soup.find('div', attrs={'class': 'userstats--progress'})
print(name_box)

Where 'user' in the first line of the code should be replaced by the user's github handle ( eg. BAJUKA ). 代码第一行中的“用户”应替换为用户的github句柄(例如BAJUKA)。

Below is the HTML tag I'm aiming to scrape: 以下是我要抓取的HTML标签:

<div class="userstats--progress">
        <p>
          Progress (<span data-js="userPRCount">5</span>/5)
        </p>
          <div class="ProgressBar ProgressBar--three u-mb--regular ProgressBar--full" data-js="progressBar"></div>
      </div>

This is what I get after I run my code: 这是我运行代码后得到的:

<div class="userstats--progress">
<p>
          Progress (<span data-js="userPRCount"></span>/5)
        </p>
<div class="ProgressBar ProgressBar--three u-mb--regular" data-js="progressBar"></div>
</div>

The difference being on the third line where the number of pull request in missing (ie in the span tag a 5 is missing) 区别在于第三行中缺少拉取请求的数量(即,span标签中缺少5)
These are the questions that I want to ask: 这些是我想问的问题:
1.Why is the no. 1.为什么没有 of pull requests ( ie 5 in this case ) missing from the scraped lines? 刮除线中缺少多少个拉取请求(即本例中为5个)?
2.How can I solve this issue? 2.如何解决这个问题? That is get the no. 那是没有。 of pull requests successfully. 请求成功。

The data you're looking for is not in the original data that the hacktober server sends, and Beautiful Soup downloads and parses; 您要查找的数据不在hacktober服务器发送的原始数据中,Beautiful Soup下载并解析; it's inserted into the HTML by the Javascript code that runs on that page in your browser after that original data is loaded. 加载原始数据后,通过浏览器在该页面上运行的Javascript代码将其插入HTML。

If you use this shell command to download the data that's actually served as the page, you'll see that the span tag you're looking at starts off empty: 如果您使用此shell命令下载实际用作页面的数据,则会看到您正在查看的span标签开始为空:

curl -s 'https://hacktoberfest.digitalocean.com/stats/BAJUKA' | grep -3 Progress

What's the javascript that fills that tag? 填充该标签的javascript是什么? Well, it's minified, so it's very hard to unpick what's going on. 好吧,它已缩小,因此很难清除正在发生的事情。 You can find it included in the very bottom of the original data, here: 您可以在以下原始数据的最底部找到它:

curl -s 'https://hacktoberfest.digitalocean.com/stats/BAJUKA' | grep -3 "script src=" | tail -n5

which when I run it, outputs this: 当我运行它时,输出以下内容:

 <script src="https://go.digitalocean.com/js/forms2/js/forms2.min.js"></script> <script src="/assets/application-134859a20456d7d32be9ea1bc32779e87cad0963355b5372df99a0cff784b7f0.js"></script> 

That crazy looking source URL is a minified piece of Javascript, which means that it's been automatically shrunk, which also means that it's almost unreadable. 看起来很疯狂的源URL是Javascript的精简版,这意味着它已被自动缩小,这也意味着它几乎不可读。 But if you go to that page. 但是,如果您转到该页面。 , and page right down to the bottom, you can see some garbled Javascript which you can try and decode. ,然后从下到下,您会看到一些乱码的Javascript,可以尝试对其进行解码。

I noticed this bit: 我注意到了这一点:

var d="2018-09-30T10%3A00%3A00%2B00%3A00",f="2018-11-01T12%3A00%3A00%2B00%3A00";$.getJSON("https://api.github.com/search/issues?q=-label:invalid+created:"+d+".."+f+"+type:pr+is:public+author:"+t+"&per_page=300"

Which I think is where it gets the data to fill that DIV. 我认为这是从那里获取数据以填充该DIV的地方。 If you load up and parse that URL, I think you'll find the data you need. 如果您加载并解析 URL,我想您会找到所需的数据。 You'll need to fill in the dates for that search, and the author. 您需要填写该搜索的日期和作者。 Good luck! 祝好运!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 无法使用Python Beautifulsoup获取所有标记/文本抓取网站 - Not able to get all tags/text scraping a website using Python Beautifulsoup 使用beautifulsoup抓取动态网站 - Scraping Dynamic website using beautifulsoup 使用 BeautifulSoup 抓取 OSHA 网站 - Scraping OSHA website using BeautifulSoup 使用 beautifulsoup 抓取网站后如何访问属性 - how to access the attributes after scraping down a website using beautifulsoup 使用Python和BeautifulSoup抓取网站数据后,CSV无法正确写入 - CSV not writing properly after scraping data for a website using Python and BeautifulSoup beautifulsoup 抓取 - 可扩展 header 文本缺失 - beautifulsoup scraping - expandable header text missing 当我尝试使用 BeautifulSoup 从网站抓取时缺少文本 - Text is missing when I try to scrape from a website using BeautifulSoup 使用 Python 中的 Beautifulsoup 从网站抓取数据并将其放入 Z251D2BBFE9A3B95E5691CEB30DC6784EBAZ ZBA834BA059A175A3798E4Z9C1 时,某些单元格中的值缺失 - Missing values in certain cells when scraping data from website using Beautifulsoup in Python and placing it in Pandas DataFrame 使用BeautifulSoup并使用不变的网址来抓取网站 - Scraping website using BeautifulSoup with unchanging URL 使用 BeautifulSoup 抓取求职网站 - Scraping job hunting website using BeautifulSoup
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM