简体   繁体   English

用正则表达式解析文本

[英]Parsing a text with regular expressions

I need to parse a text and split it very precisly. 我需要解析文本并将其非常精确地拆分。 I chose to use the regular expressions to do the job but I encounter a problem with an advanced manipulation of it in C#. 我选择使用正则表达式来完成这项工作,但是在C#中对其进行高级操作时遇到了问题。 I would appreciate any help to find the perfect solution, even if I need to take something else than regexes. 即使我需要使用除正则表达式外的其他东西,也希望能找到完美的解决方案,我将不胜感激。

Here are my criteria : 这是我的标准:

  • The text need to be splited when there is a : ; 如果存在:,则需要分割文本。 ! ? \\r \\ r
  • We can also split it if there are dots "." 如果有点“”,我们也可以拆分它。 followed by a white-space 后面跟一个空格
  • If there are white-spaces behind a separator, they need to be added. 如果分隔符后面有空白,则需要添加它们。
  • If there is an URL we do not split the ":" 如果有网址,我们不会拆分“:”
  • If there suspension dots "...", they need to be added behind 如果有悬浮点“ ...”,则需要在它们后面添加

And here is a sample text to understand better : 这是一个示例文本,可以更好地理解:

---Lorem ipsum dolor sit amet, consectetur adipiscing elit. --- Lorem ipsum dolor坐着,安全奉献精英。 Mauris euismod : tristiquetellus non egestas; 毛利人(Eurismoud):非雌性三棱; Pellentesque fermentum lectus orci ! Pellentesque fermentum lectus orci! A dictum nunc placerat sed ? 普通话 Quisque eget felis in lacus \\rcursus posuere\\r\\r Aliquam venenatis\\r nisi vitae dictum pharetra. Lacus \\ rcursus posuere \\ r \\ r Aliquam venenatis \\ r nisi vitae dictum pharetra中的魁梧的鹅粪。 ---Vivamus semper dolor quam, pellent.esque hendrerit sapien blandit ut. --- Vivamus semper dolor quam,驱蚊剂.esque hendrerit sapien blandit ut。 \\r\\r\\r\\rCras sem massa, tempor sit amet nunc id, condimentum facilisis augue... \\rhttps://www.google.com dictum nunc placerat sed \\ r \\ r \\ r \\ rCras sem massa,临时坐席,nunc id,调味品自觉... \\ rhttps://www.google.com dictum nunc placerat sed

And finally the result wanted : 最后结果想要:

 ---Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
 Mauris euismod  : 
 tristiquetellus non egestas; 
 Pellentesque fermentum lectus orci ! 
 A dictum nunc placerat sed ? 
 Quisque eget felis in lacus \r
 cursus posuere\r\r 
 Aliquam venenatis\r
 nisi vitae dictum pharetra.     \r
 ---Vivamus semper dolor quam, pellent.esque hendrerit sapien blandit ut.  \r\r\r\r
 Cras sem massa, tempor sit amet nunc id, condimentum facilisis augue...  \r
 https://www.google.com dictum nunc placerat sed

I am really far away of the result that is why I am posting here. 我真的离结果很远,这就是为什么我在这里发布。 I try at this moment to success the 1) step. 我现在尝试成功完成1)步骤。 Here is my actual code : 这是我的实际代码:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;

namespace ConsoleApplication58
{
    class Program
    {
        static void Main(string[] args)
        {            
            Regex r = new Regex(@"(\S*\w+\s*\p{P}*)+[:;!?]+\s*");

            string lorem = "---Lorem ipsum dolor sit amet, consectetur adipiscing elit. " +
                "Mauris euismod : " +
                "tristiquetellus non egestas; " +
                "Pellentesque fermentum lectus orci ! " +
                "A dictum nunc placerat sed ? " +
                "Quisque eget felis in lacus \r" +
                "cursus posuere\r\r " +
                "Aliquam venenatis\r " +
                "nisi vitae dictum pharetra. " +
                "---Vivamus semper dolor quam, pellent.esque hendrerit sapien blandit ut. \r\r\r\r" +
                "Cras sem massa, tempor sit amet nunc id, condimentum facilisis augue... \r" +
                "https://www.google.com dictum nunc placerat sed";

            MatchCollection m2 = r.Matches(lorem);

            foreach (Match match in m2)
            {
                string txt = match.Value;
                Console.WriteLine("*{0}*", txt);
            }
        }
    }
}

Thank you very much for reading this and trying to help me. 非常感谢您阅读本文并尝试帮助我。 This is kinda urgent and I can not figure out the good combination with the Matches() method from the regexes. 这有点紧急,我无法从正则表达式中找出与Matches()方法的良好组合。 Do not hesitate to ask me for more details if necessary. 如有必要,请随时询问我更多详细信息。

Since you still haven't been really clear about whether \\r is supposed to be a carriage return or a literal \\r , I'll put both: 由于您仍然不太清楚\\r是回车符还是字面量\\r ,因此我将两者都放在:

Literal: 文字:

(.+?)((?:\.{3} |[:;!?](?!/)|\. )(?:\\r)*\s*|(?:\\r)+\s*|$)

ideone demo . ideone演示

Carriage return: 回车:

(.+?)((?:\.{3} |[:;!?](?!/)|\. )(?:\r)*\s*|(?:\r)+\s*|$)

ideone demo . ideone演示

I see you have the Regex, just split the string using the Regex instance like this: ... string[] splitStringValues = r.split(lorem); 我看到您拥有Regex,只需使用Regex实例拆分字符串,如下所示:... string [] splitStringValues = r.split(lorem); or 要么

char u = ':'; char u =':'; //just initializing //只是初始化

        switch (u) 
        {
            case ':':
                //do split work here
                break;
            default:
                //do split work here
                break;
        }

签出(((http(s*))\\://){1}\\S+)|((\\S*\\w+\\s*\\p{P}*)+[:;!?]+\\s*)|(\\...)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM