正则表达式：句子末尾和URL中的区分句

Question

I am building an app that searches text for urls and adds the urls to a listbox. 我正在构建一个应用程序，该应用程序在文本中搜索URL并将URL添加到列表框。 I have something working however what I have cannot pick up urls if they end the sentence (example: this is www.google.com.). 我有一些有效的方法，但是如果它们在句子结尾（例如：这是www.google.com），我将无法获取网址。 Thanks in advance 提前致谢

Here is my code: 这是我的代码：

private void btnExtract_Click(object sender, EventArgs e)
        {
            StringBuilder taintedStr = new StringBuilder(txtInputText.Text);
            string cleanStr;

            taintedStr.Replace(",", "");
            taintedStr.Replace("!", "");
            taintedStr.Replace("(", "");
            taintedStr.Replace(")", "");
            taintedStr.Replace("[", "");
            taintedStr.Replace("]", "");
            taintedStr.Replace("http://", "");
            cleanStr = taintedStr.ToString();
            string[] wordlist = Regex.Split(cleanStr, @"\s");

            for (int i = 0; i < wordlist.Length; i++)
            {
                bool test = Regex.Match(wordlist[i], @"^[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$").Success;  
                if (test == true)
                {
                    lstWebsites.Items.Add("http://" + wordlist[i]);
                }
            } 
        }

Answer 1

Why not tweak your code by adding a line to remove ending punctuation from each word? 为什么不通过添加一行以删除每个单词的结尾标点来调整代码？ For example: 例如：

for (int i = 0; i < wordlist.Length; i++)
{
  wordlist[i] = wordlist[i].Trim().TrimEnd('.').TrimEnd('!').TrimEnd('?');
  bool test = Regex.Match(wordlist[i], @"^[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$").Success;  
  if (test == true)
  {
    lstWebsites.Items.Add("http://" + wordlist[i]);
  }
}

Alternatively, the following RegEx should catch the website: 另外，以下RegEx应该可以捕获该网站：

^[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}[.!?]?(/\S*)?$

Answer 2

You'll have to decide how to work this into your code but basically you just want to add a special case for this. 您将不得不决定如何将其用于代码中，但是基本上您只想为此添加一个特殊情况。 ".[a-zA-Z]{2,3}(/\\S*)?$.\\b" will match .*. ".[a-zA-Z]{2,3}(/\\S*)?$.\\b"将匹配.*. . 。 If this is the case then use do; 如果是这种情况，请使用do;

 myString = myString.TrimeEnd('.'); // remove the last character

/b matches on a word boundary. /b在单词边界上匹配。 It will match returns, spaces, EOF, ect. 它将匹配退货，空格，EOF等。

Answer 3

Periods at the end of a sentence are generally followed by whitespace in normal english. 句子结尾的句号后通常是普通英语的空格。 But if the period is at the end of a representation of english, it may be followed by other characters such as an EOF character, a "<", a quotation mark, etc. 但是，如果句点在英语表示形式的结尾，则可以跟在其他字符之后，例如EOF字符，“ <”，引号等。

The way to approach this problem is to recognize when the period is followed by a valid url character. 解决此问题的方法是识别句点后面是否有有效的url字符。

正则表达式：句子末尾和URL中的区分句

问题描述

3 个解决方案

解决方案1
2 已采纳 2013-04-10 03:12:22

解决方案2
0 2013-04-10 03:12:47

解决方案3
0 2013-04-10 03:13:15

正则表达式：句子末尾和URL中的区分句

问题描述

3 个解决方案

解决方案1 2 已采纳 2013-04-10 03:12:22

解决方案2 0 2013-04-10 03:12:47

解决方案3 0 2013-04-10 03:13:15

解决方案1
2 已采纳 2013-04-10 03:12:22

解决方案2
0 2013-04-10 03:12:47

解决方案3
0 2013-04-10 03:13:15