如何在 Python 中使用正則表達式查找和替換 URI 片段？

Question

你好呀！

我正在嘗試在文本文件中查找和替換 URI 片段，但我只是不知道如何做到這一點。

一些資源以 URL 開頭（例如http://www.example.com/{fragment} ），其他資源以定義的前綴開頭（例如example:{fragment} ）。 兩個片段都代表同一個對象，因此對一個出現的任何更改都必須對所有出現的前綴和 URL 片段進行，反之亦然。

下面是一個例子：

每次http://www.example.com/Example_1或example:Example_1出現時，我想替換文件中所有出現的片段Example_1 ，為 UUID（例如186e4707_afc8_4d0d_8c56_26e595eba8f0 ），導致所有出現都被替換http://www.example.com/186e4707_afc8_4d0d_8c56_26e595eba8f0或example:186e4707_afc8_4d0d_8c56_26e595eba8f0 。

這需要對文件中的每個唯一片段進行，這意味着Example_2 、 Example_3等的 UUID 不同。

到目前為止，我已經設法找到這一行 Regex: (((?<=### http:\\/\\/archive\\.semantyk\\.com\\/).*)|(?<=archive:)([^\\s]+))用於識別片段，但我真的堅持替換部分。

我相信這不是一個困難的問題，但我確實認識到它的復雜性。

我希望我能很好地解釋自己，但如果我不知道，請告訴我。

你知道如何解決這個問題嗎？？

非常感謝您閱讀這么遠。

編輯：

我嘗試使用 re.sub 使用此輸入：

###  http://archive.semantyk.com/Abbreviation
archive:Abbreviation rdf:type owl:Class ;
                     rdfs:subClassOf archive:Word .


###  http://archive.semantyk.com/Ability
archive:Ability rdf:type owl:Class ;
                rdfs:subClassOf archive:Quality .

它產生了這個結果：

###  http://archive.semantyk.com/4f5b99bb_2bff_4166_8468_0134a1d864ae
archive:4f5b99bb_2bff_4166_8468_0134a1d864ae rdf:type owl:Class ;
                     rdfs:subClassOf archive:4f5b99bb_2bff_4166_8468_0134a1d864ae .


###  http://archive.semantyk.com/4f5b99bb_2bff_4166_8468_0134a1d864ae
archive:4f5b99bb_2bff_4166_8468_0134a1d864ae rdf:type owl:Class ;
                rdfs:subClassOf archive:4f5b99bb_2bff_4166_8468_0134a1d864ae .

但這是不正確的，因為 UUID 相同但資源（片段）不同。

有任何想法嗎？

編輯：解決了！

xcan 的代碼解決了它！ 我只是做了一些調整讓它工作。

這是最終的代碼：

import re
import uuid

def generateUUID():
    identifier = uuid.uuid4().hex
    identifier = identifier[0:8] + '_' + identifier[8:12] + '_' + identifier[12:16] + '_' + identifier[16:20] + '_' + identifier[20:]
    print('Generated UUID: ' + identifier)
    return identifier

def main():
    text = open('{path}', 'r').read()
    # Firsts find what needs to changed.
    rg = r"archive:([^\s]+)"
    matches = re.findall(rg, text, re.M)
    # convert list to a set to get rid of repeating matches
    # then convert back to a list again
    unique_matches = list(set(matches))

    # Change unique words with unique uuids. Same word won't get a
    # different uuid
    for match in unique_matches:
        pattern = r"(?<=archive:)(" + match + ")"
        text = re.sub(pattern, str(generateUUID()), text)

    file = open('{path}', 'w')
    file.write(text)
    file.close()

main()

您只需要將 {path} 替換為您的文件路徑即可！ 希望這對你也有用。

干杯!

Answer 1

您可以使用 re (regex) 模塊來替換匹配的模式，讓我們看看：

import re
re.sub(pattern, repl, string, count=0, flags=0)

Answer 2

你可以通過一個函數來應用re.sub repl說法是看到這里。 因此，您可以使用自己的一套規則來處理每場比賽。

編輯

根據評論修改。 archive:..找到匹配項，然后一一替換，因此位於文件中不同位置的相同匹配項獲得相同的 uuid。

import uuid
import re


def main():
    text = """  ###  http://archive.semantyk.com/Abbreviation
archive:Abbreviation rdf:type owl:Class ;
                    rdfs:subClassOf archive:Word .
###  http://archive.semantyk.com/Ability
archive:Ability rdf:type owl:Class ;
            rdfs:subClassOf archive:Quality .
                ###  http://archive.semantyk.com/Abbreviation
archive:Abbreviation rdf:type owl:Class ;
                    rdfs:subClassOf archive:Word ."""

    # Firsts find what needs to changed.
    rg = r"archive:([^\s]+)"
    matches = re.findall(rg, text, re.M)
    # convert list to a set to get rid of repeating matches
    # then convert back to a list again
    unique_matches = list(set(matches))

    # Change unique matches with unique uuids. Same matches won't get a
    # different uuid
    for match in unique_matches:
        pattern = r"(?<=archive:)(" + match + ")"
        text = re.sub(pattern, str(uuid.uuid4()), text)

    print(text)


if __name__ == "__main__":
    main()

如何在 Python 中使用正則表達式查找和替換 URI 片段？

問題描述

編輯：

編輯：解決了！

2 個解決方案

解決方案1
0 2020-01-27 07:55:26

解決方案2
0 已采納 2020-01-27 08:52:05

編輯

如何在 Python 中使用正則表達式查找和替換 URI 片段？

問題描述

編輯：

編輯：解決了！

2 個解決方案

解決方案1 0 2020-01-27 07:55:26

解決方案2 0 已采納 2020-01-27 08:52:05

編輯

解決方案1
0 2020-01-27 07:55:26

解決方案2
0 已采納 2020-01-27 08:52:05