[英]Regex to ignore trailing dot if there is one
我有以下正則表達式(嚴格匹配看起來像URLS的東西)
(https?://\S*)
但是,這是為了提取句子中的URL,因此尾隨點可能是句子的結尾,而不是URL的合法部分。
使捕獲組忽略尾隨的句號,逗號,冒號,分號等的魔咒是什么?
(我知道匹配的URL是一場噩夢,這只需要支持松散地匹配它們,因此非常簡單的正則表達式)
這是我的測試字符串:
lorem http://www.example.com lorem https://example.com lorem
http://www.example.com.
lorem https://example.com.
這應該與所有example.com實例匹配。
(我正在使用Expresso和.NET對其進行測試)
帶有尾隨點和換行符的測試結果:
Expected string length 62 but was 64. Strings differ at index 31.
Expected: "<a href="http://www.example.com">http://www.example.com</a>.\n\r"
But was: "<a href="http://www.example.com.\n">http://www.example.com.\n</a>\r"
------------------------------------------^
范例程式碼
public class HyperlinkParser
{
private readonly Regex _regex =
new Regex(
@"(https?://\S*[^\.])");
public string Parse(string original)
{
var parsed = _regex.Replace(original, "<a href=\"$1\">$1</a>");
return parsed;
}
}
測試示例
[TestFixture]
public class HyperlinkParserTests
{
private readonly HyperlinkParser _parser = new HyperlinkParser();
private const string NO_HYPERLINKS = "dummy-text";
private const string FULL_URL = "http://www.example.com";
private const string FULL_URL_PARSED = "<a href=\"" + FULL_URL + "\">" + FULL_URL + "</a>";
private const string FULL_URL_TRAILING_DOT = FULL_URL + ".";
private const string FULL_URL_TRAILING_DOT_PARSED = "<a href=\"" + FULL_URL + "\">" + FULL_URL + "</a>.";
private const string TRAILING_DOT_AND_NEW_LINE = FULL_URL_TRAILING_DOT + "\n\r";
private const string TRAILING_DOT_AND_NEW_LINE_PARSED = FULL_URL_TRAILING_DOT_PARSED + "\n\r";
private const string COMPLEX_TEXT = "Leading stuff http://www.example.com. Other stuff.";
private const string COMPLEX_TEXT_PARSED = "Leading stuff <a href=\"http://www.example.com\">http://www.example.com</a>. Other stuff.";
[TestCase(NO_HYPERLINKS, NO_HYPERLINKS)]
[TestCase(FULL_URL, FULL_URL_PARSED)]
[TestCase(FULL_URL_TRAILING_DOT, FULL_URL_TRAILING_DOT_PARSED)]
[TestCase(TRAILING_DOT_AND_NEW_LINE, TRAILING_DOT_AND_NEW_LINE_PARSED)]
[TestCase(COMPLEX_TEXT, COMPLEX_TEXT_PARSED)]
public void Parsing(string original, string expected)
{
var actual = _parser.Parse(original);
Assert.That(actual, Is.EqualTo(expected));
}
}
試試這個,它禁止點作為最后一個字符:
(https?://\S*[^.])
例如在cygwin下,使用egrep:
$ cat ~/tmp.txt
lorem http://www.example.com lorem https://example.com lorem
http://www.example.com.
lorem https://example.com.
$ cat ~/tmp.txt | egrep -o 'https?://\S*[^.]'
http://www.example.com
https://example.com
http://www.example.com
https://example.com
( -o
選項告訴egrep只打印匹配項。)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.