简体   繁体   中英

How to find & replace URI fragments with Regex in Python?

Hi there!

I am trying to do a find and replace of URI fragments in a text file but I just don't know how this can be done.

Some resources begin with a URL (Eg http://www.example.com/{fragment} ), others begin with a defined prefix (Eg example:{fragment} ). Both fragments represent the same object, therefore any alterations to one occurrence must be made to all occurrences of the prefixed and URL fragments and viceversa.

Here's an example:

Every time http://www.example.com/Example_1 or example:Example_1 shows up I want to replace all occurrences of the fragment Example_1 in the file, for a UUID (Eg 186e4707_afc8_4d0d_8c56_26e595eba8f0 ), resulting in all occurrences to be replaced by either http://www.example.com/186e4707_afc8_4d0d_8c56_26e595eba8f0 or example:186e4707_afc8_4d0d_8c56_26e595eba8f0 .

This would need to happen for every unique fragment in the file, this means a different UUID for Example_2 , Example_3 and so on.

So far I have managed to find that this line of Regex: (((?<=### http:\\/\\/archive\\.semantyk\\.com\\/).*)|(?<=archive:)([^\\s]+)) works for identifying the fragments, but I really am stuck with the replacing part.

I believe this is not a difficult problem, but I do recognize it's complexity.

I hope I explained myself well enough, but in case I didn't please let me know.

Do you know how this can be solved??

Thank you so much for reading this far along.


EDIT:

I tried using re.sub using this input:

###  http://archive.semantyk.com/Abbreviation
archive:Abbreviation rdf:type owl:Class ;
                     rdfs:subClassOf archive:Word .


###  http://archive.semantyk.com/Ability
archive:Ability rdf:type owl:Class ;
                rdfs:subClassOf archive:Quality .

and it produces this result:

###  http://archive.semantyk.com/4f5b99bb_2bff_4166_8468_0134a1d864ae
archive:4f5b99bb_2bff_4166_8468_0134a1d864ae rdf:type owl:Class ;
                     rdfs:subClassOf archive:4f5b99bb_2bff_4166_8468_0134a1d864ae .


###  http://archive.semantyk.com/4f5b99bb_2bff_4166_8468_0134a1d864ae
archive:4f5b99bb_2bff_4166_8468_0134a1d864ae rdf:type owl:Class ;
                rdfs:subClassOf archive:4f5b99bb_2bff_4166_8468_0134a1d864ae .

But this is incorrect since the UUID is the same but the resources (fragments) are not.

Any ideas?


EDIT: SOLVED!

xcan's code solved it! I just made some tweaks for it to work.

Here's the final code:

import re
import uuid

def generateUUID():
    identifier = uuid.uuid4().hex
    identifier = identifier[0:8] + '_' + identifier[8:12] + '_' + identifier[12:16] + '_' + identifier[16:20] + '_' + identifier[20:]
    print('Generated UUID: ' + identifier)
    return identifier

def main():
    text = open('{path}', 'r').read()
    # Firsts find what needs to changed.
    rg = r"archive:([^\s]+)"
    matches = re.findall(rg, text, re.M)
    # convert list to a set to get rid of repeating matches
    # then convert back to a list again
    unique_matches = list(set(matches))

    # Change unique words with unique uuids. Same word won't get a
    # different uuid
    for match in unique_matches:
        pattern = r"(?<=archive:)(" + match + ")"
        text = re.sub(pattern, str(generateUUID()), text)

    file = open('{path}', 'w')
    file.write(text)
    file.close()

main()

You just need to replace the {path} with the path to your file and that's it! Hope this works for you too.

Cheers!

You can use re (regex) module for replacing a matching pattern, let's look:

import re
re.sub(pattern, repl, string, count=0, flags=0)

You can pass a function to re.sub with repl argument as seen here . So you can handle each match with your own set of rules.

EDIT

Edited according to the comments. archive:.. matches are found then substituted one by one so same matches located somewhere different in the file gets the same uuid.

import uuid
import re


def main():
    text = """  ###  http://archive.semantyk.com/Abbreviation
archive:Abbreviation rdf:type owl:Class ;
                    rdfs:subClassOf archive:Word .
###  http://archive.semantyk.com/Ability
archive:Ability rdf:type owl:Class ;
            rdfs:subClassOf archive:Quality .
                ###  http://archive.semantyk.com/Abbreviation
archive:Abbreviation rdf:type owl:Class ;
                    rdfs:subClassOf archive:Word ."""

    # Firsts find what needs to changed.
    rg = r"archive:([^\s]+)"
    matches = re.findall(rg, text, re.M)
    # convert list to a set to get rid of repeating matches
    # then convert back to a list again
    unique_matches = list(set(matches))

    # Change unique matches with unique uuids. Same matches won't get a
    # different uuid
    for match in unique_matches:
        pattern = r"(?<=archive:)(" + match + ")"
        text = re.sub(pattern, str(uuid.uuid4()), text)

    print(text)


if __name__ == "__main__":
    main()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM