简体   繁体   English

如何删除链接的开头并在python中添加正斜杠

[英]how to remove the start of a link and add a forward slash in python

I have some weblinks that I have scraped off a website, the problem is the links are not totally correct, as in they don't automatically download the data unless I make two changes: 我有一些我从网站上刮下来的Web链接,问题是这些链接并不完全正确,因为除非我进行以下两项更改,否则它们不会自动下载数据:

1) I get rid of the VM300:1 at the start 1)我一开始就摆脱了VM300:1

2) I put a / after the .au 2)我在.au后面加上/

Is there a way to do this automatically? 有没有一种方法可以自动执行此操作? I have about a thousand links so its not preferable to be doing this manually. 我有大约一千个链接,所以手动执行此操作不是可取的。

Below is an example of my url's 以下是我的网址的示例

urls = [
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172775/Market_Information_System_Control_daily_trading_day_190130.xlsx",
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0004/172732/Market_Information_System_Control_daily_trading_day_190129.xlsx",
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0010/172675/Market_Information_System_Control_daily_trading_day_190128.xlsx",
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0009/172674/Market_Information_System_Control_daily_trading_day_190127.xlsx",
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0008/172673/Market_Information_System_Control_daily_trading_day_190126.xlsx",
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0007/172672/Market_Information_System_Control_daily_trading_day_190125.xlsx",
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172595/Market_Information_System_Control_daily_trading_day_190124.xlsx"
]

EDIT1 编辑1

from pathlib import Path

import requests

urls = [
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172775/Market_Information_System_Control_daily_trading_day_190130.xlsx",
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0004/172732/Market_Information_System_Control_daily_trading_day_190129.xlsx",
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0010/172675/Market_Information_System_Control_daily_trading_day_190128.xlsx",
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0009/172674/Market_Information_System_Control_daily_trading_day_190127.xlsx",
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0008/172673/Market_Information_System_Control_daily_trading_day_190126.xlsx",
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0007/172672/Market_Information_System_Control_daily_trading_day_190125.xlsx",
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172595/Market_Information_System_Control_daily_trading_day_190124.xlsx"
]

urls = [x.replace('VM300:1 ','').replace('.au__', '.au/__') for x in urls]


for url in urls:
    r = requests.get(urls)
    with open(Path(urls).name, 'wb') as f:
        f.write(r.content)

ERROR: 错误:

Traceback (most recent call last):
  File "C:/Users/george/Desktop/NT/stack NT.py", line 19, in <module>
    r = requests.get(urls)
  File "C:\Python27\lib\site-packages\requests\api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Python27\lib\site-packages\requests\api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 640, in send
    adapter = self.get_adapter(url=request.url)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 731, in get_adapter
    raise InvalidSchema("No connection adapters were found for '%s'" % url)
InvalidSchema: No connection adapters were found for '['https://www.powerwater.com.au/__data/assets/excel_doc/0011/172775/Market_Information_System_Control_daily_trading_day_190130.xlsx', 'https://www.powerwater.com.au/__data/assets/excel_doc/0004/172732/Market_Information_System_Control_daily_trading_day_190129.xlsx', 'https://www.powerwater.com.au/__data/assets/excel_doc/0010/172675/Market_Information_System_Control_daily_trading_day_190128.xlsx', 'https://www.powerwater.com.au/__data/assets/excel_doc/0009/172674/Market_Information_System_Control_daily_trading_day_190127.xlsx', 'https://www.powerwater.com.au/__data/assets/excel_doc/0008/172673/Market_Information_System_Control_daily_trading_day_190126.xlsx', 'https://www.powerwater.com.au/__data/assets/excel_doc/0007/172672/Market_Information_System_Control_daily_trading_day_190125.xlsx', 'https://www.powerwater.com.au/__data/assets/excel_doc/0011/172595/Market_Information_System_Control_daily_trading_day_190124.xlsx']'

Thank you 谢谢

Use list comprehension with split and replace : 将列表理解与splitreplace

urls = [x.split()[1].replace('.au__', '.au/__') for x in urls]

Another idea with double replace : 双重replace另一个想法:

urls = [x.replace('VM300:1 ','').replace('.au__', '.au/__') for x in urls]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM