简体   繁体   English

正则表达式:句子末尾和URL中的区分句

[英]Regex: Differentiating periods at end of sentence and in URLS

I am building an app that searches text for urls and adds the urls to a listbox. 我正在构建一个应用程序,该应用程序在文本中搜索URL并将URL添加到列表框。 I have something working however what I have cannot pick up urls if they end the sentence (example: this is www.google.com.). 我有一些有效的方法,但是如果它们在句子结尾(例如:这是www.google.com),我将无法获取网址。 Thanks in advance 提前致谢

Here is my code: 这是我的代码:

private void btnExtract_Click(object sender, EventArgs e)
        {
            StringBuilder taintedStr = new StringBuilder(txtInputText.Text);
            string cleanStr;

            taintedStr.Replace(",", "");
            taintedStr.Replace("!", "");
            taintedStr.Replace("(", "");
            taintedStr.Replace(")", "");
            taintedStr.Replace("[", "");
            taintedStr.Replace("]", "");
            taintedStr.Replace("http://", "");
            cleanStr = taintedStr.ToString();
            string[] wordlist = Regex.Split(cleanStr, @"\s");

            for (int i = 0; i < wordlist.Length; i++)
            {
                bool test = Regex.Match(wordlist[i], @"^[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$").Success;  
                if (test == true)
                {
                    lstWebsites.Items.Add("http://" + wordlist[i]);
                }
            } 
        }

Why not tweak your code by adding a line to remove ending punctuation from each word? 为什么不通过添加一行以删除每个单词的结尾标点来调整代码? For example: 例如:

for (int i = 0; i < wordlist.Length; i++)
{
  wordlist[i] = wordlist[i].Trim().TrimEnd('.').TrimEnd('!').TrimEnd('?');
  bool test = Regex.Match(wordlist[i], @"^[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$").Success;  
  if (test == true)
  {
    lstWebsites.Items.Add("http://" + wordlist[i]);
  }
} 

Alternatively, the following RegEx should catch the website: 另外,以下RegEx应该可以捕获该网站:

^[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}[.!?]?(/\S*)?$

You'll have to decide how to work this into your code but basically you just want to add a special case for this. 您将不得不决定如何将其用于代码中,但是基本上您只想为此添加一个特殊情况。 ".[a-zA-Z]{2,3}(/\\S*)?$.\\b" will match .*. ".[a-zA-Z]{2,3}(/\\S*)?$.\\b"将匹配.*. . If this is the case then use do; 如果是这种情况,请使用do;

 myString = myString.TrimeEnd('.'); // remove the last character

/b matches on a word boundary. /b在单词边界上匹配。 It will match returns, spaces, EOF, ect. 它将匹配退货,空格,EOF等。

Periods at the end of a sentence are generally followed by whitespace in normal english. 句子结尾的句号后通常是普通英语的空格。 But if the period is at the end of a representation of english, it may be followed by other characters such as an EOF character, a "<", a quotation mark, etc. 但是,如果句点在英语表示形式的结尾,则可以跟在其他字符之后,例如EOF字符,“ <”,引号等。

The way to approach this problem is to recognize when the period is followed by a valid url character. 解决此问题的方法是识别句点后面是否有有效的url字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM