简体   繁体   English

Python正则表达式从文本中提取域

[英]Python Regex to Extract Domain from Text

I have the following regex: 我有以下正则表达式:

r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'

When I apply this to a text string with, let's say, "this is www.website1.com and this is website2.com", I get: 当我将其应用于带有“这是www.website1.com和这是website2.com”的文本字符串时,我得到:

['www.website1.com']

['website.com']

How can i modify the regex to exclude the 'www ', so that I get 'website1.com' and 'website2.com ? 如何修改正则表达式以排除'www ”,以便获得'website1.com''website2.com I'm missing something pretty basic ... 我缺少一些非常基本的东西...

Try this one (thanks @SunDeep for the update): 试试这个(感谢@SunDeep提供更新):

\s(?:www.)?(\w+.com)

Explanation 说明

\\s matches any whitespace character \\s匹配任何空格字符

(?:www.)? non-capturing group, matches www. 非捕获组,匹配www. 0 or more times 0次以上

(\\w+.com) matches any word character one or more times, followed by .com (\\w+.com)一次或多次匹配任何单词字符,后跟.com

And in action: 并采取行动:

import re

s = 'this is www.website1.com and this is website2.com'

matches = re.findall(r'\s(?:www.)?(\w+.com)', s)
print(matches)

Output: 输出:

['website1.com', 'website2.com']

A couple notes about this. 关于此的一些注意事项。 First of all, matching all valid domain names is very difficult to do, so while I chose to use \\w+ to capture for this example, I could have chosen something like: [a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9]\\.[a-zA-Z]{2,} . 首先,很难匹配所有有效域名,因此,在本例中,我选择使用\\w+进行捕获,但我可以选择类似以下内容: [a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9]\\.[a-zA-Z]{2,}

This answer has a lot of helpful info about matching domains: What is a regular expression which will match a valid domain name without a subdomain? 这个答案有很多有关匹配域的有用信息: 什么是正则表达式,它将匹配没有子域的有效域名?

Next, I only look for .com domains, you could adjust my regular expression to something like: 接下来,我只查找.com域,您可以将正则表达式调整为以下形式:

\s(?:www.)?(\w+.(com|org|net))

To match whichever types of domains you were looking for. 匹配您要查找的任何类型的域。

Here a try : 这里尝试:

import re
s = "www.website1.com"
k = re.findall ( '(www.)?(.*?)$', s, re.DOTALL)[0][1]
print(k)

O/P like : O / P像:

'website1.com'

if it is s = "website1.com" also it will o/p like : 如果它是s = "website1.com"它也会像下面这样:

'website1.com'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM