简体   繁体   English

Python:在某些情况下用正则表达式查找并替换

[英]Python: Find and replace under certain conditions with regex

basically i want to write a script that will sanitize a URL, will replace the dots with "(dot)" string. 基本上我想写一个脚本来清理URL,将点替换为“(dot)”字符串。 For example if i have http://www.google.com after i run the script i would like it to be http://www(dot)google(dot) . 例如,如果我在运行脚本后拥有http://www.google.com ,则希望它为http:// www(dot)google(dot) Well this is pretty easy to achieve with .replace when my text file consist only of urls or other strings, but in my case i also have IP addresses inside my text file, and i dont want the dots in the IP address to change to "(dot)". 好吧,当我的文本文件仅由url或其他字符串组成时,使用.replace即可轻松实现,但就我而言,我的文本文件中也有IP地址,并且我不希望IP地址中的点变为“ (点)”。

I tried to do this using regex, but my output is " http://ww(dot)oogl(dot)om 192.60.10.10 33.44.55.66 " 我尝试使用正则表达式执行此操作,但是我的输出是“ http:// ww(dot)oogl(dot)om 192.60.10.10 33.44.55.66”

This is my code 这是我的代码

from __future__ import print_function


import sys
import re

nargs = len(sys.argv)
if nargs < 2:

    sys.exit('You did not specify a file')
else:
    inputFile = sys.argv[1]
    fp = open(inputFile)
    content = fp.read()

replace = '(dot)'
regex = '[a-z](\.)[a-z]'
print(re.sub(regex, replace, content, re.M| re.I| re.DOTALL))

I guess i need to have a condition which checks that if the pattern is number.number - dont replace. 我想我需要有一个条件来检查模式是否为number.number-不要替换。

You can use lookahead and lookbehind assertions: 您可以使用先行断言和后行断言:

import  re

s = "http://www.google.com 127.0.0.1"

print(re.sub("(?<=[a-z])\.(?=[a-z])", "(dot)", s))
http://www(dot)google(dot)com 127.0.0.1

To work for letters and digits this should hopefully do the trick, making sure there is at least one letter: 为了处理字母和数字,这应该可以解决问题,并确保至少有一个字母:

s = "http://www.googl1.2com 127.0.0.1"

print(re.sub("(?=.*[a-z])(?<=\w)\.(?=\w)", "(dot)", s, re.I))

http://www(dot)googl1(dot)2com 127.0.0.1

For your file you need re.M : 对于您的文件,您需要re.M

In [1]: cat test.txt
google8.com
google9.com
192.60.10.10
33.44.55.66
google10.com
192.168.1.1
google11.com

In [2]: with open("test.txt") as f:
   ...:         import re
   ...:         print(re.sub("(?=.*[a-z])(?<=\w)\.(?=\w)", "(dot)", f.read(), re.I|re.M))
   ...:     
google8(dot)com
google9(dot)com
192.60.10.10
33.44.55.66
google10(dot)com
192.168.1.1
google11(dot)com

If the files were large and memory was an issue you could also do it line by line, either storing all the lines in a list or using each line as you go: 如果文件很大并且存在内存问题,您也可以逐行进行处理,将所有行存储在列表中,或者在运行时使用每一行:

import re
with open("test.txt") as f:
    r = re.compile("(?=.*[a-z])(?<=\w)\.(?=\w)", re.I)
    lines = [r.sub("(?=.*[a-z])(?<=\w)\.(?=\w)", "(dot)") for line in f]

Judging by your code, you were hoping to replace the first group within your pattern . 从您的代码来看,您希望替换模式中的第一组。 However, re.sub replaces the entire matching pattern, rather than a group. 但是, re.sub代替了整个匹配模式,而不是一个组。 In your case this is the single character right before the period, the period itself and the single character after it. 在您的情况下,这是句点之前的单个字符,句点本身以及句点之后的单个字符。

Even if sub worked as you hoped, your regex would be missing edge cases where numbers are part of URLs, such as www.2048game.com . 即使sub像您希望的那样工作,您的正则表达式也会缺少数字是URL一部分的边缘情况,例如www.2048game.com Defining what an IP looks like is far easier. 定义IP的外观要容易得多。 It's always a set of four numbers with one, two or three digits each, separated by dots. 它始终是由四个数字组成的集合,每个数字分别具有一个,两个或三个数字,并用点分隔。 (In the case of IPv4, anyway. But IPv6 does not use periods, so it doesn't matter here.) (无论如何,对于IPv4。但是IPv6不使用句点,因此在这里无关紧要。)

Assuming you have only URLs and IPs in your text file, simply filter out all IPs and then replace the periods in the remaining URLs: 假设文本文件中只有URL和IP,只需过滤掉所有IP,然后替换其余URL中的句点即可:

is_ip = re.compile('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
urls = content.split(" ")
for i, url in enumerate(urls):
    if not is_ip.match(url):
        urls[i] = url.replace('.', '(dot)')
content = ' '.join(urls)

Of course, if you have regular text in content , this will also replace all regular periods, not just URLs. 当然,如果content包含常规文本,这还将替换所有常规句点,而不仅仅是URL。 In that case, you would first require a more intricate URL detection. 在这种情况下,您首先需要进行更复杂的URL检测。 See In search of the perfect URL validation regex 请参见寻找完美的URL验证正则表达式

You have to store the [az] content before and after the dot, to put it again in the replaced string. 您必须在点前后存储[az]内容,然后将其再次放入替换的字符串中。 Here how I solved it: 这是我如何解决的:

from __future__ import print_function
import sys
import re

nargs = len(sys.argv)
if nargs < 2:
    sys.exit('You did not specify a file')
else:
    inputFile = sys.argv[1]
    fp = open(inputFile)
    content = fp.read()

replace = '\\1(dot)\\3'
regex = '(.*[a-z])(\.)([a-z].*)'
print(re.sub(regex, replace, content, re.M| re.I| re.DOTALL))
import re

content = "I tried to do this using regex, but my output is http://www.googl.com 192.60.10.10 33.44.55.66\nhttp://ya.ru\n..."

reg = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

all_urls = re.findall(reg, content, re.M| re.I| re.DOTALL)
repl_urls = [u.replace('.', '(dot)') for u in all_urls]

for u, r in zip(all_urls, repl_urls):
    content = content.replace(u, r)

print content

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM