繁体   English   中英

如何使用 Pandas 或请求访问 Python 中的私有 Github Repo 文件 (.csv)

[英]How to Access Private Github Repo File (.csv) in Python using Pandas or Requests

我不得不将我的公共 Github 存储库切换为私有并且无法访问文件,而不是使用我能够使用公共 Github 存储库访问的访问令牌。

我可以使用 curl 访问我的私人仓库 CSV: ''' curl -s https://{token}@raw.githubusercontent.com/username/repo/master/file.csv

'''

但是,我想在我的 python 文件中访问此信息。 当 repo 公开时,我可以简单地使用: ''' url = 'https://raw.githubusercontent.com/username/repo/master/file.csv ' df = pd.read_csv(url, error_bad_lines=False)

'''

现在这不再有效,因为回购是私人的,我找不到解决方法来下载这个 CSV 到 python 而不是从终端拉出。

如果我尝试: ''' requests.get(https://{token}@raw.githubusercontent.com/username/repo/master/file.csv) ''' 我得到 404 响应,这基本上是同一回事pd.read_csv() 正在发生这种情况。 如果我单击原始文件,我会看到创建了一个临时令牌,并且 URL 是: ''' https://raw.githubusercontent.com/username/repo/master/file.csv?token=TEMPTOKEN ''' 是有没有办法附加我的永久私人访问令牌,这样我就可以随时从 github 中提取这些数据?

是的,您可以在 Python 中下载 CSV 文件,而不是从终端拉取。 为了实现这一点,您可以使用 GitHub API v3 与“请求”和“io”模块帮助。 下面的可重现示例。

import numpy as np
import pandas as pd
import requests
from io import StringIO

# Create CSV file
df = pd.DataFrame(np.random.randint(2,size=10_000).reshape(1_000,10))
df.to_csv('filename.csv') 

# -> now upload file to private github repo

# define parameters for a request
token = 'paste-there-your-personal-access-token' 
owner = 'repository-owner-name'
repo = 'repository-name-where-data-is-stored'
path = 'filename.csv'

# send a request
r = requests.get(
    'https://api.github.com/repos/{owner}/{repo}/contents/{path}'.format(
    owner=owner, repo=repo, path=path),
    headers={
        'accept': 'application/vnd.github.v3.raw',
        'authorization': 'token {}'.format(token)
            }
    )

# convert string to StringIO object
string_io_obj = StringIO(r.text)

# Load data to df
df = pd.read_csv(string_io_obj, sep=",", index_col=0)

# optionally write df to CSV
df.to_csv("file_name_02.csv")

这就是最终对我有用的东西——如果有人遇到同样的问题,就把它留在这里。 谢谢您的帮助!

    import json, requests, urllib, io

    user='my_github_username'
    pao='my_pao'

    github_session = requests.Session()
    github_session.auth = (user, pao)

    # providing raw url to download csv from github
    csv_url = 'https://raw.githubusercontent.com/user/repo/master/csv_name.csv'

    download = github_session.get(url_swing).content
    downloaded_csv = pandas.read_csv(io.StringIO(download.decode('utf-8')), error_bad_lines=False)

你看过pygithub吗? 对于访问存储库、文件、拉取请求、历史记录等非常有用。文档在此处 这是一个示例脚本,它打开一个拉取请求,一个从基础分支出发的新分支(您将需要该访问令牌,或者生成一个新的):并删除一个文件:

from github import Github
my_reviewers = ['usernames', 'of_reviewers']
gh = Github("<token string>")
repo_name = '<my_org>/<my_repo>'
repo = gh.get_repo(repo_name)
default_branch_name = repo.default_branch
base = repo.get_branch(default_branch_name)
new_branch_name = "my_new_branchname"
new_branch = repo.create_git_ref(ref=f'refs/heads/{new_branch_name}',sha=base.commit.sha)
contents = repo.get_contents("some_script_in_repo.sh", ref=new_branch_name)
repo.delete_file(contents.path, "commit message", contents.sha, branch=new_branch_name)
pr = repo.create_pull(
    title="PR to Remove some_script_in_repo.sh",
    body="This is the text in the main body of your pull request",
    head=new_branch_name,
    base=default_branch_name,
)
pr.create_review_request(reviewers=my_reviewers)

希望对您有所帮助,祝您编码愉快!

这种方式对我来说真的很好:

    def _github(url: str, mode: str = "private"):
        url = url.replace("/blob/", "/")
        url = url.replace("/raw/", "/")
        url = url.replace("github.com/", "raw.githubusercontent.com/")

        if mode == "public":
            return requests.get(url)
        else:
            token = os.getenv('GITHUB_TOKEN', '...')
            headers = {
                'Authorization': f'token {token}',
                'Accept': 'application/vnd.github.v3.raw'}
            return requests.get(url, headers=headers)

添加另一个工作示例:

import requests
from requests.structures import CaseInsensitiveDict

# Variables
GH_PREFIX = "https://raw.githubusercontent.com"
ORG = "my-user-name"
REPO = "my-repo-name"
BRANCH = "main"
FOLDER = "some-folder"
FILE = "some-file.csv"
URL = GH_PREFIX + "/" + ORG + "/" + REPO + "/" + BRANCH + "/" + FOLDER + "/" + FILE

# Headers setup
headers = CaseInsensitiveDict()
headers["Authorization"] = "token " + GITHUB_TOKEN

# Execute and view status
resp = requests.get(URL, headers=headers)
if resp.status_code == 200:
   print(resp.content)
else:
   print("Request failed!")

显然,现在,原始的 githubusercontent 链接也可以简单地使用令牌,但在 python 的请求情况下,它们需要一个用户名:令牌组合,这在 github 更改它之前是规范,因此只有一个令牌就足够了。

所以:

https://{token}@raw.githubusercontent.com/username/repo/master/file.csv

成为

https://{username}:{token}@raw.githubusercontent.com/username/repo/master/file.csv

上面的示例代码如下:

from requests import get as rget

res = rget("https://<username>:<token>@raw.githubusercontent.com/<username>/repo/<repo>/file.csv")
with open('file.csv', 'wb+') as f:
        f.write(res.content)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM