[英]Python Regex to Extract Domain from Text
I have the following regex: 我有以下正则表达式:
r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'
When I apply this to a text string with, let's say, "this is www.website1.com and this is website2.com", I get: 当我将其应用于带有“这是www.website1.com和这是website2.com”的文本字符串时,我得到:
['www.website1.com']
['website.com']
How can i modify the regex to exclude the 'www
', so that I get 'website1.com'
and 'website2.com
? 如何修改正则表达式以排除
'www
”,以便获得'website1.com'
和'website2.com
? I'm missing something pretty basic ... 我缺少一些非常基本的东西...
Try this one (thanks @SunDeep for the update): 试试这个(感谢@SunDeep提供更新):
\s(?:www.)?(\w+.com)
Explanation 说明
\\s
matches any whitespace character \\s
匹配任何空格字符
(?:www.)?
non-capturing group, matches www.
非捕获组,匹配
www.
0 or more times 0次以上
(\\w+.com)
matches any word character one or more times, followed by .com
(\\w+.com)
一次或多次匹配任何单词字符,后跟.com
And in action: 并采取行动:
import re
s = 'this is www.website1.com and this is website2.com'
matches = re.findall(r'\s(?:www.)?(\w+.com)', s)
print(matches)
Output: 输出:
['website1.com', 'website2.com']
A couple notes about this. 关于此的一些注意事项。 First of all, matching all valid domain names is very difficult to do, so while I chose to use
\\w+
to capture for this example, I could have chosen something like: [a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9]\\.[a-zA-Z]{2,}
. 首先,很难匹配所有有效域名,因此,在本例中,我选择使用
\\w+
进行捕获,但我可以选择类似以下内容: [a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9]\\.[a-zA-Z]{2,}
。
This answer has a lot of helpful info about matching domains: What is a regular expression which will match a valid domain name without a subdomain? 这个答案有很多有关匹配域的有用信息: 什么是正则表达式,它将匹配没有子域的有效域名?
Next, I only look for .com
domains, you could adjust my regular expression to something like: 接下来,我只查找
.com
域,您可以将正则表达式调整为以下形式:
\s(?:www.)?(\w+.(com|org|net))
To match whichever types of domains you were looking for. 匹配您要查找的任何类型的域。
Here a try : 这里尝试:
import re
s = "www.website1.com"
k = re.findall ( '(www.)?(.*?)$', s, re.DOTALL)[0][1]
print(k)
O/P like : O / P像:
'website1.com'
if it is s = "website1.com"
also it will o/p like : 如果它是
s = "website1.com"
它也会像下面这样:
'website1.com'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.