简体   繁体   中英

How to Access Private Github Repo File (.csv) in Python using Pandas or Requests

I had to switch my public Github repository to private and cannot access files, not with access tokens that I was able to with the public Github repo.

I can access my private repo's CSV with curl: ''' curl -s https://{token}@raw.githubusercontent.com/username/repo/master/file.csv

'''

However, I want to access this information in my python file. When the repo was public I could simply use: ''' url = 'https://raw.githubusercontent.com/username/repo/master/file.csv ' df = pd.read_csv(url, error_bad_lines=False)

'''

This no longer works now that the repo is private, and I cannot find a work around to download this CSV in python instead of pulling from terminal.

If I try: ''' requests.get(https://{token}@raw.githubusercontent.com/username/repo/master/file.csv) ''' I get a 404 response, which is basically the same thing that is happening with the pd.read_csv(). If I click on the raw file I see that a temporary token is created and the URL is: ''' https://raw.githubusercontent.com/username/repo/master/file.csv?token=TEMPTOKEN ''' Is there a way to attach my permanent private access token so that I can always pull this data from github?

Yes, you may download CSV file in Python instead of pulling from terminal. To achieve that you may use GitHub API v3 with 'requests' and 'io' modules assistance. Reproducible example below.

import numpy as np
import pandas as pd
import requests
from io import StringIO

# Create CSV file
df = pd.DataFrame(np.random.randint(2,size=10_000).reshape(1_000,10))
df.to_csv('filename.csv') 

# -> now upload file to private github repo

# define parameters for a request
token = 'paste-there-your-personal-access-token' 
owner = 'repository-owner-name'
repo = 'repository-name-where-data-is-stored'
path = 'filename.csv'

# send a request
r = requests.get(
    'https://api.github.com/repos/{owner}/{repo}/contents/{path}'.format(
    owner=owner, repo=repo, path=path),
    headers={
        'accept': 'application/vnd.github.v3.raw',
        'authorization': 'token {}'.format(token)
            }
    )

# convert string to StringIO object
string_io_obj = StringIO(r.text)

# Load data to df
df = pd.read_csv(string_io_obj, sep=",", index_col=0)

# optionally write df to CSV
df.to_csv("file_name_02.csv")

This is what ended up working for me - leaving it here if anyone runs into the same issue. Thanks for the help!

    import json, requests, urllib, io

    user='my_github_username'
    pao='my_pao'

    github_session = requests.Session()
    github_session.auth = (user, pao)

    # providing raw url to download csv from github
    csv_url = 'https://raw.githubusercontent.com/user/repo/master/csv_name.csv'

    download = github_session.get(url_swing).content
    downloaded_csv = pandas.read_csv(io.StringIO(download.decode('utf-8')), error_bad_lines=False)

Have you looked at the pygithub ? Very useful for accessing repos, files, pull requests, history, etc. Docs are here . Here's an example script, which opens a pull request, a new branch off a base branch (you'll need that Access Token, or generate a new one,): and removes a file:

from github import Github
my_reviewers = ['usernames', 'of_reviewers']
gh = Github("<token string>")
repo_name = '<my_org>/<my_repo>'
repo = gh.get_repo(repo_name)
default_branch_name = repo.default_branch
base = repo.get_branch(default_branch_name)
new_branch_name = "my_new_branchname"
new_branch = repo.create_git_ref(ref=f'refs/heads/{new_branch_name}',sha=base.commit.sha)
contents = repo.get_contents("some_script_in_repo.sh", ref=new_branch_name)
repo.delete_file(contents.path, "commit message", contents.sha, branch=new_branch_name)
pr = repo.create_pull(
    title="PR to Remove some_script_in_repo.sh",
    body="This is the text in the main body of your pull request",
    head=new_branch_name,
    base=default_branch_name,
)
pr.create_review_request(reviewers=my_reviewers)

Hope that helps, happy coding!

This way is working for me really good:

    def _github(url: str, mode: str = "private"):
        url = url.replace("/blob/", "/")
        url = url.replace("/raw/", "/")
        url = url.replace("github.com/", "raw.githubusercontent.com/")

        if mode == "public":
            return requests.get(url)
        else:
            token = os.getenv('GITHUB_TOKEN', '...')
            headers = {
                'Authorization': f'token {token}',
                'Accept': 'application/vnd.github.v3.raw'}
            return requests.get(url, headers=headers)

Adding another working example:

import requests
from requests.structures import CaseInsensitiveDict

# Variables
GH_PREFIX = "https://raw.githubusercontent.com"
ORG = "my-user-name"
REPO = "my-repo-name"
BRANCH = "main"
FOLDER = "some-folder"
FILE = "some-file.csv"
URL = GH_PREFIX + "/" + ORG + "/" + REPO + "/" + BRANCH + "/" + FOLDER + "/" + FILE

# Headers setup
headers = CaseInsensitiveDict()
headers["Authorization"] = "token " + GITHUB_TOKEN

# Execute and view status
resp = requests.get(URL, headers=headers)
if resp.status_code == 200:
   print(resp.content)
else:
   print("Request failed!")

Apparently, nowadays, rawgithubusercontent links also work simply with a token, but in python's request case, they need a username:token combination which used to be the norm before github changed it so that only a token is sufficient.

So:

https://{token}@raw.githubusercontent.com/username/repo/master/file.csv

becomes

https://{username}:{token}@raw.githubusercontent.com/username/repo/master/file.csv

A sample code for the above would be as follows:

from requests import get as rget

res = rget("https://<username>:<token>@raw.githubusercontent.com/<username>/repo/<repo>/file.csv")
with open('file.csv', 'wb+') as f:
        f.write(res.content)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM