简体   繁体   English

如何正确替换文本文件的内容

[英]How to properly replace the contents of text file

I am trying to make an offline copy of this website: ieeghn . 我正在尝试制作该网站的离线副本: ieeghn Part of this task is to download all css/js that being referred to using Beautiful Soup and modify any external link to this newly downloaded resource. 此任务的一部分是下载使用Beautiful Soup引用的所有css / js,并修改指向此新下载资源的任何外部链接。

At the moment I simply use string replace method. 目前,我只是使用字符串replace方法。 But I don't think this is effective, as I do this inside a loop, snippet below: 但是我不认为这是有效的,因为我在一个循环中执行此操作,以下代码段:

local_content = '' 
for res in soup.findAll('link', {'rel': 'stylesheet'}):
            if not str(res['href']).startswith('data:'):
                original_res = res['href']
                res['href'] = some_function_to_download_css()
                local_content = local_content.replace(original_res, res['href'])

I only save resource for non-embedding resource that start with data: . 我只保存以data:开头的非嵌入资源data: But the problem is, that local_content = local_content.replace(original_res, res['href']) may lead to the problem that I only able to modify one external resource into local resource. 但是问题是, local_content = local_content.replace(original_res, res['href'])可能导致以下问题:我只能将一个外部资源修改为本地资源。 The rest still refer to online version of the resource. 其余部分仍参考该资源的在线版本。

I am guessing that because local_content is a very long string (have a look at the ieeghn source), this didn't work out well. 我猜测是因为local_content是一个很长的字符串(请查看ieeghn源代码),所以效果不佳。

How do you properly replace content of a string for a given pattern? 如何正确替换给定模式的字符串内容? Or do I have to store this first to a file and modify it there? 还是我必须先将其存储到文件中并在那里进行修改?

EDITED I found the problem was in this line of code: 编辑我发现问题出在这行代码:

 original_res = res['href']

BSoup will somehow sanitized the href string. BSoup将以某种方式清除href字符串。 In my case, & 就我而言, & will be changed to & . 将更改为& As I am trying to replace the original href into a newly downloaded local file, str.replace() simply won't find this original value. 当我尝试将原始href替换为新下载的本地文件时, str.replace()根本找不到该原始值。 Either I have to found a way to have original HREF or simply handle this case. 我要么必须找到一种拥有原始HREF的方法,要么直接处理这种情况。 Got to say, having the original HREF is the best way 可以说,拥有原始HREF是最好的方法

You're already replacing the content, in a way... 您已经以某种方式替换了内容...

res['href'] = some_function_to_download_css()

...updates the href attribute of the res node in BeautifulSoup's representation of the HTML tree. ...在HTML的BeautifulSoup表示形式中更新res节点的href属性。

To make it more efficient, you could cache the URLs of CSS files you've already downloaded, and consult the cache before downloading the file. 为了提高效率,您可以缓存已经下载的CSS文件的URL,并在下载文件之前查阅缓存。 Once you're done (and if you're OK with BS's attribute ordering/indentation/etc.), you can get the string representation of the tree with str(soup) . 完成后(如果您对BS的属性ordering / indentation /等没问题,可以使用str(soup)获得树的字符串表示形式str(soup)

Reference: http://beautiful-soup-4.readthedocs.org/en/latest/#changing-tag-names-and-attributes 参考: http//beautiful-soup-4.readthedocs.org/en/latest/#changing-tag-names-and-attributes

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM