简体   繁体   English

网络抓取公共 Github 存储库的问题

[英]Issue with web-scraping a public Github repo

I am trying to scrape a public Github repo ( https://github.com/stlrda/redb_python/tree/master/python/DAGs ) in order to grab the name and datetime from each file.我正在尝试抓取公共 Github 存储库( https://github.com/stlrda/redb_python/tree/master/python/DAGs ),以便从每个文件中获取名称和日期时间。 The code that I have posted below will work, but not all of the time.我在下面发布的代码将起作用,但并非总是如此。 Sometimes I get an Index out of range error when it runs the DAGs[counter]['age'] = x.find('.no-wrap')[0].attrs['datetime'] line.有时,当它运行DAGs[counter]['age'] = x.find('.no-wrap')[0].attrs['datetime']行时,我会得到一个索引超出范围错误。 I'm very confused why this code will sometimes work and other times fails to find the datetime.我很困惑为什么这段代码有时会起作用,而有时却找不到日期时间。 Any ideas on how I can fix this to find the datetime every run?关于如何解决这个问题以找到每次运行的日期时间的任何想法?

session = HTMLSession()
r = session.get('https://github.com/stlrda/redb_python/tree/master/python/DAGs')

div = r.html.find('tbody', first=True)
title = div.find('.content')

DAGs = []

#Grab the names of each DAG in the repo
for x in range((len(title))):

    if x == 0:
        continue
    else:
        info = {"name": title[x].text}
        DAGs.append(info)

#Update the dictionary with the age of the DAG
gitTable = div.find('.js-navigation-item')

counter = 0
for x in gitTable:
    DAGs[counter]['age'] = x.find('.no-wrap')[0].attrs['datetime']
#     print (x.find('.no-wrap')[0].attrs['datetime'])
    counter+=1

When the code fails, here is what the gitTable variable contains:当代码失败时,gitTable 变量包含以下内容:

[<Element 'tr' class=('js-navigation-item',)>,
 <Element 'tr' class=('js-navigation-item',)>,
 <Element 'tr' class=('js-navigation-item',)>,
 <Element 'tr' class=('js-navigation-item',)>]

And the html of one of these items in the gitTable list is: gitTable 列表中这些项目之一的 html 是:

>>>gitTable[0].html
'<tr class="js-navigation-item">\n<td class="icon">\n<svg aria-label="file" class="octicon octicon-file" height="16" role="img" version="1.1" viewbox="0 0 12 16" width="12"><path d="M6 5H2V4h4v1zM2 8h7V7H2v1zm0 2h7V9H2v1zm0 2h7v-1H2v1zm10-7.5V14c0 .55-.45 1-1 1H1c-.55 0-1-.45-1-1V2c0-.55.45-1 1-1h7.5L12 4.5zM11 5L8 2H1v12h10V5z" fill-rule="evenodd"/></svg>\n<img alt="" class="spinner" height="16" src="https://github.githubassets.com/images/spinners/octocat-spinner-32.gif" width="16"/>\n</td>\n<td class="content">\n<span class="css-truncate css-truncate-target"><a class="js-navigation-open" href="/stlrda/redb_python/blob/master/python/DAGs/MigratetoPG_DAG.py" id="5554cd417ad3b8097206c9a0e81566d0-7416c3966dc565eb1b0115b89fa72116e4cc3ee6" title="MigratetoPG_DAG.py">MigratetoPG_DAG.py</a></span>\n</td>\n<td class="message">\n<span class="css-truncate css-truncate-target">\n</span>\n</td>\n<td class="age">\n<span class="css-truncate css-truncate-target"/>\n</td>\n</tr>'

Looks like I was taking a much harder route by trying to scrape GitHub, and completely overlooked their API.看起来我通过尝试抓取 GitHub 采取了更艰难的路线,并且完全忽略了他们的 API。

The commits and contents endpoints were able to provide me with the file name and datetime info that I needed.提交和内容端点能够为我提供我需要的文件名和日期时间信息。 Below are examples of the endpoints.以下是端点的示例。

I could not find a single endpoint that gave both the filename and the datetime data, so if anyone knows of one, please let me know.我找不到同时提供文件名和日期时间数据的单个端点,所以如果有人知道,请告诉我。

Datetime --> https://api.github.com/repos/ github account / repo name /commits?path= path to folder日期时间 --> https://api.github.com/repos/ github 账户/仓库名称/commits?path=文件夹路径

Name --> https://api.github.com/repos/ github account / repo name /contents/ path to folder名称 --> https://api.github.com/repos/ github 账户/仓库名称/contents/文件夹路径

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM