简体   繁体   中英

Python: Find and replace under certain conditions with regex

basically i want to write a script that will sanitize a URL, will replace the dots with "(dot)" string. For example if i have http://www.google.com after i run the script i would like it to be http://www(dot)google(dot) . Well this is pretty easy to achieve with .replace when my text file consist only of urls or other strings, but in my case i also have IP addresses inside my text file, and i dont want the dots in the IP address to change to "(dot)".

I tried to do this using regex, but my output is " http://ww(dot)oogl(dot)om 192.60.10.10 33.44.55.66 "

This is my code

from __future__ import print_function


import sys
import re

nargs = len(sys.argv)
if nargs < 2:

    sys.exit('You did not specify a file')
else:
    inputFile = sys.argv[1]
    fp = open(inputFile)
    content = fp.read()

replace = '(dot)'
regex = '[a-z](\.)[a-z]'
print(re.sub(regex, replace, content, re.M| re.I| re.DOTALL))

I guess i need to have a condition which checks that if the pattern is number.number - dont replace.

You can use lookahead and lookbehind assertions:

import  re

s = "http://www.google.com 127.0.0.1"

print(re.sub("(?<=[a-z])\.(?=[a-z])", "(dot)", s))
http://www(dot)google(dot)com 127.0.0.1

To work for letters and digits this should hopefully do the trick, making sure there is at least one letter:

s = "http://www.googl1.2com 127.0.0.1"

print(re.sub("(?=.*[a-z])(?<=\w)\.(?=\w)", "(dot)", s, re.I))

http://www(dot)googl1(dot)2com 127.0.0.1

For your file you need re.M :

In [1]: cat test.txt
google8.com
google9.com
192.60.10.10
33.44.55.66
google10.com
192.168.1.1
google11.com

In [2]: with open("test.txt") as f:
   ...:         import re
   ...:         print(re.sub("(?=.*[a-z])(?<=\w)\.(?=\w)", "(dot)", f.read(), re.I|re.M))
   ...:     
google8(dot)com
google9(dot)com
192.60.10.10
33.44.55.66
google10(dot)com
192.168.1.1
google11(dot)com

If the files were large and memory was an issue you could also do it line by line, either storing all the lines in a list or using each line as you go:

import re
with open("test.txt") as f:
    r = re.compile("(?=.*[a-z])(?<=\w)\.(?=\w)", re.I)
    lines = [r.sub("(?=.*[a-z])(?<=\w)\.(?=\w)", "(dot)") for line in f]

Judging by your code, you were hoping to replace the first group within your pattern . However, re.sub replaces the entire matching pattern, rather than a group. In your case this is the single character right before the period, the period itself and the single character after it.

Even if sub worked as you hoped, your regex would be missing edge cases where numbers are part of URLs, such as www.2048game.com . Defining what an IP looks like is far easier. It's always a set of four numbers with one, two or three digits each, separated by dots. (In the case of IPv4, anyway. But IPv6 does not use periods, so it doesn't matter here.)

Assuming you have only URLs and IPs in your text file, simply filter out all IPs and then replace the periods in the remaining URLs:

is_ip = re.compile('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
urls = content.split(" ")
for i, url in enumerate(urls):
    if not is_ip.match(url):
        urls[i] = url.replace('.', '(dot)')
content = ' '.join(urls)

Of course, if you have regular text in content , this will also replace all regular periods, not just URLs. In that case, you would first require a more intricate URL detection. See In search of the perfect URL validation regex

You have to store the [az] content before and after the dot, to put it again in the replaced string. Here how I solved it:

from __future__ import print_function
import sys
import re

nargs = len(sys.argv)
if nargs < 2:
    sys.exit('You did not specify a file')
else:
    inputFile = sys.argv[1]
    fp = open(inputFile)
    content = fp.read()

replace = '\\1(dot)\\3'
regex = '(.*[a-z])(\.)([a-z].*)'
print(re.sub(regex, replace, content, re.M| re.I| re.DOTALL))
import re

content = "I tried to do this using regex, but my output is http://www.googl.com 192.60.10.10 33.44.55.66\nhttp://ya.ru\n..."

reg = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

all_urls = re.findall(reg, content, re.M| re.I| re.DOTALL)
repl_urls = [u.replace('.', '(dot)') for u in all_urls]

for u, r in zip(all_urls, repl_urls):
    content = content.replace(u, r)

print content

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM