After reading the answers from this question: C# regex pattern to extract urls from given string - not full html urls but bare links as well I want to know which would be the fastest way to extract urls from a document, by using regex matching or by using string split method.
So, you have a string containing an html document and want to extract urls.
The regex way would be:
Regex linkParser = new Regex(@"\b(?:https?://|www\.)\S+\b", RegexOptions.Compiled | RegexOptions.IgnoreCase);
string rawString = "house home go www.monstermmorpg.com nice hospital http://www.monstermmorpg.com this is incorrect url http://www.monstermmorpg.commerged continue";
foreach(Match m in linkParser.Matches(rawString))
MessageBox.Show(m.Value);
And the string split method:
string rawString = "house home go www.monstermmorpg.com nice hospital http://www.monstermmorpg.com this is incorrect url http://www.monstermmorpg.commerged continue";
var links = rawString.Split("\t\n ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries).Where(s => s.StartsWith("http://") || s.StartsWith("www.") || s.StartsWith("https://"));
foreach (string s in links)
MessageBox.Show(s);
Which one is the most performant way to do it?
Split is faster. Here is some code that you can test with: dotnetfiddle link
using System;
using System.Diagnostics;
using System.Linq;
using System.Text.RegularExpressions;
public class Program
{
public void Main()
{
Stopwatch sw = new Stopwatch();
sw.Start();
for (int i=0; i < 500; i++)
{
Regex linkParser = new Regex(@"\b(?:https?://|www\.)\S+\b", RegexOptions.Compiled | RegexOptions.IgnoreCase);
string rawString = "house home go www.monstermmorpg.com nice hospital http://www.monstermmorpg.com this is incorrect url http://www.monstermmorpg.commerged continue";
}
sw.Stop();
var test1Time = sw.ElapsedMilliseconds;
sw.Reset();
sw.Start();
for (int i=0; i < 500; i++)
{
string rawString = "house home go www.monstermmorpg.com nice hospital http://www.monstermmorpg.com this is incorrect url http://www.monstermmorpg.commerged continue";
var links = rawString.Split("\t\n ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries).Where(s => s.StartsWith("http://") || s.StartsWith("www.") || s.StartsWith("https://"));
}
sw.Stop();
var test2Time = sw.ElapsedMilliseconds;
Console.WriteLine("Regex Test: " + test1Time.ToString());
Console.WriteLine("Split Test: " + test2Time.ToString());
}
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.