简体   繁体   English

清理 URL 并将它们保存到 txt 文件 Python3

[英]Cleaning URLs and saving them to txt file Python3

I am trying to clean and normalize URLs in a text file.我正在尝试清理和规范化文本文件中的 URL。

Here is my current code:这是我当前的代码:

import re

with open("urls.txt", encoding='utf-8') as f:
    content = f.readlines()
content = [x.strip() for x in content]

url_format = "https://www.google"
for item in content:
    if not item.startswith(url_format):
        old_item = item
        new_item = re.sub(r'.*google', url_format, item)
        content.append(new_item)
        content.remove(old_item)

with open('result.txt', mode='wt', encoding='utf-8') as myfile:
    myfile.write('\n'.join(content))

The issue is that if I print the old and new items in the loop, it shows me that each URL has been cleaned.问题是,如果我在循环中打印新旧项目,它会显示每个 URL 都已清理。 But when I print my list of URLs outside of the loop, the URLs are still not cleaned, some of them get removed and some of them do not.但是当我在循环外打印我的 URL 列表时,这些 URL 仍然没有被清除,其中一些被删除,而另一些则没有。

May I ask why the bad URLs still is inside the list when I remove them in my for loop and add the cleaned URL?请问为什么当我在 for 循环中删除它们并添加清理后的 URL 时,为什么错误的 URL 仍然在列表中? Perhaps this should be resolved in a different way?也许这应该以不同的方式解决?

Also, I have noticed that with a big set of URLs it takes a lot of time for the code to run, perhaps I should use different tools?此外,我注意到使用大量 URL 运行代码需要花费大量时间,也许我应该使用不同的工具?

Any help will be appreciated.任何帮助将不胜感激。

That is because you removing items from the list while iterating over it, which is a bad thing to do, you could either create another list that has the new values and append to it, or modify the list in-place using indexing, you could also just use a list comprehension for this task:那是因为你在迭代它的同时从列表中删除项目,这是一件坏事,你可以创建另一个具有新值和 append 的列表,或者使用索引就地修改列表,你可以也只需使用列表理解来完成此任务:

content = [item if item.startswith(url_format) else re.sub(r'.*google', url_format, item) for item in content]

Or, using another list:或者,使用另一个列表:

new_content = []

for item in content:
    if item.startswith(url_format):
        new_content.append(item)
    else:
        new_content.append(re.sub(r'.*google', url_format, item))

Or, modifying the list in-place, using indexing:或者,使用索引就地修改列表:

for i, item in enumerate(content):
    if not item.startswith(url_format):
        content[i] = re.sub(r'.*google', url_format, item)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM