如何在 Python 中使用正则表达式查找和替换 URI 片段？

Question

Hi there!你好呀！

I am trying to do a find and replace of URI fragments in a text file but I just don't know how this can be done.我正在尝试在文本文件中查找和替换 URI 片段，但我只是不知道如何做到这一点。

Some resources begin with a URL (Eg http://www.example.com/{fragment} ), others begin with a defined prefix (Eg example:{fragment} ).一些资源以 URL 开头（例如http://www.example.com/{fragment} ），其他资源以定义的前缀开头（例如example:{fragment} ）。 Both fragments represent the same object, therefore any alterations to one occurrence must be made to all occurrences of the prefixed and URL fragments and viceversa.两个片段都代表同一个对象，因此对一个出现的任何更改都必须对所有出现的前缀和 URL 片段进行，反之亦然。

Here's an example:下面是一个例子：

Every time http://www.example.com/Example_1 or example:Example_1 shows up I want to replace all occurrences of the fragment Example_1 in the file, for a UUID (Eg 186e4707_afc8_4d0d_8c56_26e595eba8f0 ), resulting in all occurrences to be replaced by either http://www.example.com/186e4707_afc8_4d0d_8c56_26e595eba8f0 or example:186e4707_afc8_4d0d_8c56_26e595eba8f0 .每次http://www.example.com/Example_1或example:Example_1出现时，我想替换文件中所有出现的片段Example_1 ，为 UUID（例如186e4707_afc8_4d0d_8c56_26e595eba8f0 ），导致所有出现都被替换http://www.example.com/186e4707_afc8_4d0d_8c56_26e595eba8f0或example:186e4707_afc8_4d0d_8c56_26e595eba8f0 。

This would need to happen for every unique fragment in the file, this means a different UUID for Example_2 , Example_3 and so on.这需要对文件中的每个唯一片段进行，这意味着Example_2 、 Example_3等的 UUID 不同。

So far I have managed to find that this line of Regex: (((?<=### http:\\/\\/archive\\.semantyk\\.com\\/).*)|(?<=archive:)([^\\s]+)) works for identifying the fragments, but I really am stuck with the replacing part.到目前为止，我已经设法找到这一行 Regex: (((?<=### http:\\/\\/archive\\.semantyk\\.com\\/).*)|(?<=archive:)([^\\s]+))用于识别片段，但我真的坚持替换部分。

I believe this is not a difficult problem, but I do recognize it's complexity.我相信这不是一个困难的问题，但我确实认识到它的复杂性。

I hope I explained myself well enough, but in case I didn't please let me know.我希望我能很好地解释自己，但如果我不知道，请告诉我。

Do you know how this can be solved??你知道如何解决这个问题吗？？

Thank you so much for reading this far along.非常感谢您阅读这么远。

EDIT:编辑：

I tried using re.sub using this input:我尝试使用 re.sub 使用此输入：

###  http://archive.semantyk.com/Abbreviation
archive:Abbreviation rdf:type owl:Class ;
                     rdfs:subClassOf archive:Word .


###  http://archive.semantyk.com/Ability
archive:Ability rdf:type owl:Class ;
                rdfs:subClassOf archive:Quality .

and it produces this result:它产生了这个结果：

###  http://archive.semantyk.com/4f5b99bb_2bff_4166_8468_0134a1d864ae
archive:4f5b99bb_2bff_4166_8468_0134a1d864ae rdf:type owl:Class ;
                     rdfs:subClassOf archive:4f5b99bb_2bff_4166_8468_0134a1d864ae .


###  http://archive.semantyk.com/4f5b99bb_2bff_4166_8468_0134a1d864ae
archive:4f5b99bb_2bff_4166_8468_0134a1d864ae rdf:type owl:Class ;
                rdfs:subClassOf archive:4f5b99bb_2bff_4166_8468_0134a1d864ae .

But this is incorrect since the UUID is the same but the resources (fragments) are not.但这是不正确的，因为 UUID 相同但资源（片段）不同。

Any ideas?有任何想法吗？

EDIT: SOLVED!编辑：解决了！

xcan's code solved it! xcan 的代码解决了它！ I just made some tweaks for it to work.我只是做了一些调整让它工作。

Here's the final code:这是最终的代码：

import re
import uuid

def generateUUID():
    identifier = uuid.uuid4().hex
    identifier = identifier[0:8] + '_' + identifier[8:12] + '_' + identifier[12:16] + '_' + identifier[16:20] + '_' + identifier[20:]
    print('Generated UUID: ' + identifier)
    return identifier

def main():
    text = open('{path}', 'r').read()
    # Firsts find what needs to changed.
    rg = r"archive:([^\s]+)"
    matches = re.findall(rg, text, re.M)
    # convert list to a set to get rid of repeating matches
    # then convert back to a list again
    unique_matches = list(set(matches))

    # Change unique words with unique uuids. Same word won't get a
    # different uuid
    for match in unique_matches:
        pattern = r"(?<=archive:)(" + match + ")"
        text = re.sub(pattern, str(generateUUID()), text)

    file = open('{path}', 'w')
    file.write(text)
    file.close()

main()

You just need to replace the {path} with the path to your file and that's it!您只需要将 {path} 替换为您的文件路径即可！ Hope this works for you too.希望这对你也有用。

Cheers!干杯!

Answer 1

You can use re (regex) module for replacing a matching pattern, let's look:您可以使用 re (regex) 模块来替换匹配的模式，让我们看看：

import re
re.sub(pattern, repl, string, count=0, flags=0)

Answer 2

You can pass a function to re.sub with repl argument as seen here .你可以通过一个函数来应用re.sub repl说法是看到这里。 So you can handle each match with your own set of rules.因此，您可以使用自己的一套规则来处理每场比赛。

EDIT编辑

Edited according to the comments.根据评论修改。 archive:.. matches are found then substituted one by one so same matches located somewhere different in the file gets the same uuid. archive:..找到匹配项，然后一一替换，因此位于文件中不同位置的相同匹配项获得相同的 uuid。

import uuid
import re


def main():
    text = """  ###  http://archive.semantyk.com/Abbreviation
archive:Abbreviation rdf:type owl:Class ;
                    rdfs:subClassOf archive:Word .
###  http://archive.semantyk.com/Ability
archive:Ability rdf:type owl:Class ;
            rdfs:subClassOf archive:Quality .
                ###  http://archive.semantyk.com/Abbreviation
archive:Abbreviation rdf:type owl:Class ;
                    rdfs:subClassOf archive:Word ."""

    # Firsts find what needs to changed.
    rg = r"archive:([^\s]+)"
    matches = re.findall(rg, text, re.M)
    # convert list to a set to get rid of repeating matches
    # then convert back to a list again
    unique_matches = list(set(matches))

    # Change unique matches with unique uuids. Same matches won't get a
    # different uuid
    for match in unique_matches:
        pattern = r"(?<=archive:)(" + match + ")"
        text = re.sub(pattern, str(uuid.uuid4()), text)

    print(text)


if __name__ == "__main__":
    main()

如何在 Python 中使用正则表达式查找和替换 URI 片段？

问题描述

EDIT:编辑：

EDIT: SOLVED!编辑：解决了！

2 个解决方案

解决方案1
0 2020-01-27 07:55:26

解决方案2
0 已采纳 2020-01-27 08:52:05

EDIT编辑

如何在 Python 中使用正则表达式查找和替换 URI 片段？

问题描述

EDIT:编辑：

EDIT: SOLVED!编辑：解决了！

2 个解决方案

解决方案1 0 2020-01-27 07:55:26

解决方案2 0 已采纳 2020-01-27 08:52:05

EDIT编辑

解决方案1
0 2020-01-27 07:55:26

解决方案2
0 已采纳 2020-01-27 08:52:05