简体   繁体   English

Python 3 清洁和标准化 URL 列表

[英]Python 3 clean and normalize URL list

I have a list of URL in a text file and I need using Python 3 run a function so that the URL's would match the format of https://www.google.com/images/我在文本文件中有一个 URL 的列表,我需要使用 Python 3 运行 function 以便 URL 匹配https://www.google.com/images/的格式

An example of the list:列表示例:

http://www.google.com/images/<text>
https://ca.google.com/images/<text>
https://www.google.com/images/<text>
http://uk.google.com/images/<text>
https://www.google.com/images/<text>

I would need to make a script that would read through the file, clean the URL so for example the URL http://www.google.com/images/ will change to https://www.google.com/images/ and would replace the country code with www as well.我需要制作一个脚本来读取文件,清理 URL 例如 URL http://www.google.com/images/将更改为https://www.google.com/images/和也会用www替换国家代码。 So, if it is http://ca.google.com It should change to https://www.google.com所以,如果是http://ca.google.com应该改成https://www.google.com

May I ask what tools should I use to detect incorrect URL's so I could locate them, fix them and save to the file?请问我应该使用什么工具来检测不正确的 URL,以便我可以找到它们,修复它们并保存到文件中?

Any help will be appreciated, thank you!任何帮助将不胜感激,谢谢!

Current code:当前代码:

urls = open("urls.txt", "r", encoding='utf-8')
urls = [item.replace('http://', 'https://') for item in urls]
for item in urls:
    if not 'www' in item:
        old_item = item
        v = str(item[8:10])
        new_item = item.replace(v, 'www')
        urls.append(new_item)
        urls.remove(old_item)
print(urls)

Since strings are immutable in python we can't change alphabets in them but make new strings, hence the slight complication.由于字符串在 python 中是不可变的,我们不能更改其中的字母表,只能创建新的字符串,因此会稍微复杂一些。 First we remove the http elements.首先我们移除http元素。 Then we check if www is present in the link or not.然后我们检查链接中是否存在www If not we replace the country code(two alphabets) with www如果不是,我们用www替换国家代码(两个字母)

list1 = ['http://www.google.com/images', 'https://ca.google.com/images','https://www.google.com/images','http://uk.google.com/images',
'https://www.google.com/images']
list1 = [item.replace('http://', 'https://') for item in list1]
for item in list1:
    if not 'www' in item:
        old_item = item
        v = str(item[8:10])
        new_item = item.replace(v, 'www')
        list1.append(new_item)
        list1.remove(old_item)

print(list1)

Output: ['https://www.google.com/images', 'https://www.google.com/images', 'https://www.google.com/images', 'https://www.google.com/images', 'https://www.google.com/images'] Output: ['https://www.google.com/images', 'https://www.google.com/images', 'https://www.google.com/images', 'https://www.google.com/images', 'https://www.google.com/images']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM