Hi there!
I am trying to do a find and replace of URI fragments in a text file but I just don't know how this can be done.
Some resources begin with a URL (Eg http://www.example.com/{fragment}
), others begin with a defined prefix (Eg example:{fragment}
). Both fragments represent the same object, therefore any alterations to one occurrence must be made to all occurrences of the prefixed and URL fragments and viceversa.
Here's an example:
Every time http://www.example.com/Example_1
or example:Example_1
shows up I want to replace all occurrences of the fragment Example_1
in the file, for a UUID (Eg 186e4707_afc8_4d0d_8c56_26e595eba8f0
), resulting in all occurrences to be replaced by either http://www.example.com/186e4707_afc8_4d0d_8c56_26e595eba8f0
or example:186e4707_afc8_4d0d_8c56_26e595eba8f0
.
This would need to happen for every unique fragment in the file, this means a different UUID for Example_2
, Example_3
and so on.
So far I have managed to find that this line of Regex: (((?<=### http:\\/\\/archive\\.semantyk\\.com\\/).*)|(?<=archive:)([^\\s]+))
works for identifying the fragments, but I really am stuck with the replacing part.
I believe this is not a difficult problem, but I do recognize it's complexity.
I hope I explained myself well enough, but in case I didn't please let me know.
Do you know how this can be solved??
Thank you so much for reading this far along.
I tried using re.sub using this input:
### http://archive.semantyk.com/Abbreviation
archive:Abbreviation rdf:type owl:Class ;
rdfs:subClassOf archive:Word .
### http://archive.semantyk.com/Ability
archive:Ability rdf:type owl:Class ;
rdfs:subClassOf archive:Quality .
and it produces this result:
### http://archive.semantyk.com/4f5b99bb_2bff_4166_8468_0134a1d864ae
archive:4f5b99bb_2bff_4166_8468_0134a1d864ae rdf:type owl:Class ;
rdfs:subClassOf archive:4f5b99bb_2bff_4166_8468_0134a1d864ae .
### http://archive.semantyk.com/4f5b99bb_2bff_4166_8468_0134a1d864ae
archive:4f5b99bb_2bff_4166_8468_0134a1d864ae rdf:type owl:Class ;
rdfs:subClassOf archive:4f5b99bb_2bff_4166_8468_0134a1d864ae .
But this is incorrect since the UUID is the same but the resources (fragments) are not.
Any ideas?
xcan's code solved it! I just made some tweaks for it to work.
Here's the final code:
import re
import uuid
def generateUUID():
identifier = uuid.uuid4().hex
identifier = identifier[0:8] + '_' + identifier[8:12] + '_' + identifier[12:16] + '_' + identifier[16:20] + '_' + identifier[20:]
print('Generated UUID: ' + identifier)
return identifier
def main():
text = open('{path}', 'r').read()
# Firsts find what needs to changed.
rg = r"archive:([^\s]+)"
matches = re.findall(rg, text, re.M)
# convert list to a set to get rid of repeating matches
# then convert back to a list again
unique_matches = list(set(matches))
# Change unique words with unique uuids. Same word won't get a
# different uuid
for match in unique_matches:
pattern = r"(?<=archive:)(" + match + ")"
text = re.sub(pattern, str(generateUUID()), text)
file = open('{path}', 'w')
file.write(text)
file.close()
main()
You just need to replace the {path} with the path to your file and that's it! Hope this works for you too.
Cheers!
You can use re (regex) module for replacing a matching pattern, let's look:
import re
re.sub(pattern, repl, string, count=0, flags=0)
You can pass a function to re.sub with repl
argument as seen here . So you can handle each match with your own set of rules.
Edited according to the comments. archive:..
matches are found then substituted one by one so same matches located somewhere different in the file gets the same uuid.
import uuid
import re
def main():
text = """ ### http://archive.semantyk.com/Abbreviation
archive:Abbreviation rdf:type owl:Class ;
rdfs:subClassOf archive:Word .
### http://archive.semantyk.com/Ability
archive:Ability rdf:type owl:Class ;
rdfs:subClassOf archive:Quality .
### http://archive.semantyk.com/Abbreviation
archive:Abbreviation rdf:type owl:Class ;
rdfs:subClassOf archive:Word ."""
# Firsts find what needs to changed.
rg = r"archive:([^\s]+)"
matches = re.findall(rg, text, re.M)
# convert list to a set to get rid of repeating matches
# then convert back to a list again
unique_matches = list(set(matches))
# Change unique matches with unique uuids. Same matches won't get a
# different uuid
for match in unique_matches:
pattern = r"(?<=archive:)(" + match + ")"
text = re.sub(pattern, str(uuid.uuid4()), text)
print(text)
if __name__ == "__main__":
main()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.