用正则表达式解析文本

Question

I need to parse a text and split it very precisly. 我需要解析文本并将其非常精确地拆分。 I chose to use the regular expressions to do the job but I encounter a problem with an advanced manipulation of it in C#. 我选择使用正则表达式来完成这项工作，但是在C＃中对其进行高级操作时遇到了问题。 I would appreciate any help to find the perfect solution, even if I need to take something else than regexes. 即使我需要使用除正则表达式外的其他东西，也希望能找到完美的解决方案，我将不胜感激。

Here are my criteria : 这是我的标准：

The text need to be splited when there is a : ; 如果存在：，则需要分割文本。 ! ！ ? ？ \\r \\ r
We can also split it if there are dots "." 如果有点“”，我们也可以拆分它。 followed by a white-space 后面跟一个空格
If there are white-spaces behind a separator, they need to be added. 如果分隔符后面有空白，则需要添加它们。
If there is an URL we do not split the ":" 如果有网址，我们不会拆分“：”
If there suspension dots "...", they need to be added behind 如果有悬浮点“ ...”，则需要在它们后面添加

And here is a sample text to understand better : 这是一个示例文本，可以更好地理解：

---Lorem ipsum dolor sit amet, consectetur adipiscing elit. --- Lorem ipsum dolor坐着，安全奉献精英。 Mauris euismod : tristiquetellus non egestas; 毛利人（Eurismoud）：非雌性三棱； Pellentesque fermentum lectus orci ! Pellentesque fermentum lectus orci！ A dictum nunc placerat sed ? 普通话 Quisque eget felis in lacus \\rcursus posuere\\r\\r Aliquam venenatis\\r nisi vitae dictum pharetra. Lacus \\ rcursus posuere \\ r \\ r Aliquam venenatis \\ r nisi vitae dictum pharetra中的魁梧的鹅粪。 ---Vivamus semper dolor quam, pellent.esque hendrerit sapien blandit ut. --- Vivamus semper dolor quam，驱蚊剂.esque hendrerit sapien blandit ut。 \\r\\r\\r\\rCras sem massa, tempor sit amet nunc id, condimentum facilisis augue... \\rhttps://www.google.com dictum nunc placerat sed \\ r \\ r \\ r \\ rCras sem massa，临时坐席，nunc id，调味品自觉... \\ rhttps：//www.google.com dictum nunc placerat sed

And finally the result wanted : 最后结果想要：

 ---Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
 Mauris euismod  : 
 tristiquetellus non egestas; 
 Pellentesque fermentum lectus orci ! 
 A dictum nunc placerat sed ? 
 Quisque eget felis in lacus \r
 cursus posuere\r\r 
 Aliquam venenatis\r
 nisi vitae dictum pharetra.     \r
 ---Vivamus semper dolor quam, pellent.esque hendrerit sapien blandit ut.  \r\r\r\r
 Cras sem massa, tempor sit amet nunc id, condimentum facilisis augue...  \r
 https://www.google.com dictum nunc placerat sed

I am really far away of the result that is why I am posting here. 我真的离结果很远，这就是为什么我在这里发布。 I try at this moment to success the 1) step. 我现在尝试成功完成1）步骤。 Here is my actual code : 这是我的实际代码：

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;

namespace ConsoleApplication58
{
    class Program
    {
        static void Main(string[] args)
        {            
            Regex r = new Regex(@"(\S*\w+\s*\p{P}*)+[:;!?]+\s*");

            string lorem = "---Lorem ipsum dolor sit amet, consectetur adipiscing elit. " +
                "Mauris euismod : " +
                "tristiquetellus non egestas; " +
                "Pellentesque fermentum lectus orci ! " +
                "A dictum nunc placerat sed ? " +
                "Quisque eget felis in lacus \r" +
                "cursus posuere\r\r " +
                "Aliquam venenatis\r " +
                "nisi vitae dictum pharetra. " +
                "---Vivamus semper dolor quam, pellent.esque hendrerit sapien blandit ut. \r\r\r\r" +
                "Cras sem massa, tempor sit amet nunc id, condimentum facilisis augue... \r" +
                "https://www.google.com dictum nunc placerat sed";

            MatchCollection m2 = r.Matches(lorem);

            foreach (Match match in m2)
            {
                string txt = match.Value;
                Console.WriteLine("*{0}*", txt);
            }
        }
    }
}

Thank you very much for reading this and trying to help me. 非常感谢您阅读本文并尝试帮助我。 This is kinda urgent and I can not figure out the good combination with the Matches() method from the regexes. 这有点紧急，我无法从正则表达式中找出与Matches（）方法的良好组合。 Do not hesitate to ask me for more details if necessary. 如有必要，请随时询问我更多详细信息。

Answer 1

Since you still haven't been really clear about whether \\r is supposed to be a carriage return or a literal \\r , I'll put both: 由于您仍然不太清楚\\r是回车符还是字面量\\r ，因此我将两者都放在：

Literal: 文字：

(.+?)((?:\.{3} |[:;!?](?!/)|\. )(?:\\r)*\s*|(?:\\r)+\s*|$)

ideone demo . ideone演示。

Carriage return: 回车：

(.+?)((?:\.{3} |[:;!?](?!/)|\. )(?:\r)*\s*|(?:\r)+\s*|$)

ideone demo . ideone演示。

Answer 2

I see you have the Regex, just split the string using the Regex instance like this: ... string[] splitStringValues = r.split(lorem); 我看到您拥有Regex，只需使用Regex实例拆分字符串，如下所示：... string [] splitStringValues = r.split（lorem）; or 要么

char u = ':'; char u ='：'; //just initializing //只是初始化

        switch (u) 
        {
            case ':':
                //do split work here
                break;
            default:
                //do split work here
                break;
        }

Answer 3

签出(((http(s*))\\://){1}\\S+)|((\\S*\\w+\\s*\\p{P}*)+[:;!?]+\\s*)|(\\...)

用正则表达式解析文本

问题描述

3 个解决方案

解决方案1
1 已采纳 2013-09-09 16:26:02

解决方案2
0 2013-09-09 14:54:46

解决方案3
0 2013-09-09 15:49:56

用正则表达式解析文本

问题描述

3 个解决方案

解决方案1 1 已采纳 2013-09-09 16:26:02

解决方案2 0 2013-09-09 14:54:46

解决方案3 0 2013-09-09 15:49:56

解决方案1
1 已采纳 2013-09-09 16:26:02

解决方案2
0 2013-09-09 14:54:46

解决方案3
0 2013-09-09 15:49:56