从Python中的HTML链接下载文本数据

Question

Hi I want to download delimited text which is hosted on a HTML Link. 嗨，我想下载HTML链接上托管的带分隔符的文本。 (The link is accessible on a Private network only, so can't share here). （该链接只能在专用网络上访问，因此不能在此处共享）。

In R, following function solves the purpose (all other functions gave "Unauthorized access" or "401" error) 在R中，以下功能解决了这个问题（所有其他功能都给出了“未经授权的访问”或“ 401”错误）

url = 'https://dw-results.ansms.com/dw-platform/servlet/results? job_id=13802737&encoding=UTF8&mimeType=plain'
download.file(url, "~/insights_dashboard/testing_file.tsv")
a = read.csv("~/insights_dashboard/testing_file.tsv",header = T,stringsAsFactors = F,sep='\t')

I want to do the same thing in Python, for which I used: 我想在Python中做同样的事情，为此我使用了：

(A)urllib and requests.get() （A）urllib和request.get（）

import urllib.request

url_get = requests.get(url, verify=False)
urllib.request.urlretrieve(url_get, 'C:\\Users\\cssaxena\\Desktop\\24.tsv')

(B)requests.get() and read.html （B）requests.get（）和read.html

url='https://dw-results.ansms.com/dw-platform/servlet/results? job_id=13802737&encoding=UTF8&mimeType=plain'
s = requests.get(url, verify=False)
a = pd.read_html(io.StringIO(s.decode('utf-8')))

(C) Using wget: （C）使用wget：

import wget
url = 'https://dw-results.ansms.com/dw-platform/servlet/results? job_id=13802737&encoding=UTF8&mimeType=plain'  
wget.download(url,--auth-no-challenge, 'C:\\Users\\cssaxena\\Desktop\\24.tsv')

OR 要么

wget --server-response -owget.log "https://dw-results.ansms.com/dw-platform/servlet/results? job_id=13802737&encoding=UTF8&mimeType=plain"

NOTE: The URL doesn't asks for any credentials and it is accessible by browser and able to download using R with download.file. 注意：URL不需要任何凭据，浏览器可以访问该URL，并且可以使用R和download.file进行下载。 I am looking for a solution in Python 我正在寻找Python中的解决方案

Answer 1

def geturls(path):
    yy=open(path,'rb').read()
    yy="".join(str(yy))
    yy=yy.split('<a')

    out=[]
    for d in yy:
        z=d.find('href="')
        if z>-1:
            x=d[z+6:len(d)]
            r=x.find('"')
            x=x[:r]
            x=x.strip(' ./')
                 #
            if (len(x)>2) and (x.find(";")==-1):
                out.append(x.strip(" /"))
    out=set(out)
    return(out)

pg="./test.html"# your html

url=geturls(pg)

print(url)

从Python中的HTML链接下载文本数据

问题描述

1 个解决方案

解决方案1
0 2018-08-27 15:40:04

从Python中的HTML链接下载文本数据

问题描述

1 个解决方案

解决方案1 0 2018-08-27 15:40:04

解决方案1
0 2018-08-27 15:40:04