简体   繁体   中英

Parse multiple hostnames from string

I am trying to parse multiple hostnames from a string using a Regex in C#.

Example string: abc.google.com another example here abc.microsoft.com and another example abc.bbc.co.uk

The code I have been trying is below:

string input = "abc.google.com another example here abc.microsoft.com and another example abc.bbc.co.uk";
string FQDN_Pat = @"^([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])(\.([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]{0,61}[a-zA-Z0-9]))*$";

Regex r = new Regex(FQDN_Pat);
Match m = r.Match(input);         
while (m.Success)
{
    txtBoxOut.Text += "Match: " + m.Value + " ";
    m = m.NextMatch();
}

The code works if the string fits the pattern exactly eg abc.google.com .

How can I change the Regex to match the patterns that fit within the example string eg so the output would be:

Match: abc.google.com
Match: abc.microsoft.com
Match: abc.bbc.co.uk

Apologies in advance if this is something very simple as my knowledge of regular expressions is not great! :) Thanks!

UPDATE:

Updating the Regex to the following (removing the ^ and $ ):

string FQDN_Pat = @"([a-zA-Z0-9]([a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])?)(\.([a-zA-Z0-9]|[a-zA-Z0-9][a-zA‌​-Z0-9\-]{0,61}[a-zA-Z0-9]))"; 

Results in the following output:

Match 1: abc.g
Match 2: oogle.c
Match 3: abc.m
Match 4: icrosoft.c
Match 5: abc.b
Match 6: bc.c
Match 7: ou

As the regexp is quite complicated I tried to simplify it a bit. So what I've done was to

  1. Remove ^ and $ to make the regexp match anywhere
  2. Simplify characters that you match to , so instead of ([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\\-]{0,61}[a-zA-Z0-9]) i'm using ([a-zA-Z0-9])+ which means look for any alphanumeric sequence with length higher than one (the + sign means that you match to a char that appears once or more). Let's call it X . If the rules for names in FQDN are more complex please modify this value
  3. Expression for finding FQDN is X(\\.X)+ . This can be viewed as sequence of chars followed by one or more sequences, all are separated by dots ( . ). Substitiuting X you have full expression given as

     string FQDN_Pat = @"([a-zA-Z0-9]+)(\\.([a-zA-Z0-9])+)+"; 

which actually matches to your example but I suggest you read C# regexp manuals for further references in case there are some tricks in domain names

You get this behavior because you are only matching the string that contain nothing else but your pattern. You are using ^ (start of the string) and $ (end of the string). If you want to match your pattern anywhere in the input string remove those characters from the pattern.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM