简体   繁体   English

使用模板从文本中提取数据

[英]Extracting data from text using templates

I'm building a web service which receives emails from a number of CRM-systems. 我正在构建一个Web服务,该服务接收来自许多CRM系统的电子邮件。 Emails typically contain a text status eg "Received" or "Completed" as well as a free text comment. 电子邮件通常包含文本状态,例如“已接收”或“已完成”,以及自由文本注释。

The formats of the incoming email are different, eg some systems call the status "Status: ZZZZZ" and some "Action: ZZZZZ". 传入电子邮件的格式不同,例如,某些系统将状态称为“状态:ZZZZZ”,而某些状态则称为“操作:ZZZZZ”。 The free text sometimes appear before the status and somethings after. 自由文本有时出现在状态之前,之后出现。 Status codes will be mapped to my systems interpretation and the comment is required too. 状态代码将映射到我的系统解释中,并且也需要注释。

Moreover, I'd expect that the the formats change over time so a solution that is configurable, possibly by customers providing their own templates thru a web interface would be ideal. 此外,我希望格式会随着时间的推移而变化,因此可以配置的解决方案可能是理想的,可能是客户通过Web界面提供自己的模板。

The service is built using .NET C# MVC 3 but I'd be interested in general strategies as well as any specific libraries/tools/approaches. 该服务是使用.NET C#MVC 3构建的,但我对一般策略以及任何特定的库/工具/方法感兴趣。

I've never quite got my head around RegExp. 我从来没有完全了解RegExp。 I'll make a new effort in case it is indeed the way to go. 万一这确实是我要走的道路,我将做出新的努力。 :) :)

I would go with regex: 我会使用正则表达式:

First example, if you had only Status: ZZZZZ - like messages: 第一个示例,如果您只有Status: ZZZZZ类似消息:

String status = Regex.Match(@"(?<=Status: ).*");
// Explanation of "(?<=Status: ).*" :
// (?<=       Start of the positive look-behind group: it means that the 
//            following text is required but won't appear in the returned string
// Status:    The text defining the email string format
// )          End of the positive look-behind group
// .*         Matches any character

Second example if you had only Status: ZZZZZ and Action: ZZZZZ - like messages: 第二个示例,如果您只有Status: ZZZZZAction: ZZZZZ类似消息:

String status = Regex.Match(@"(?<=(Status|Action): ).*");
// We added (Status|Action) that allows the positive look-behind text to be 
// either 'Status: ', or 'Action: '

Now if you want to give the possibility to the user to provide its own format, you could come up with something like: 现在,如果您想让用户提供自己的格式,可以提出以下内容:

String userEntry = GetUserEntry(); // Get the text submitted by the user
String userFormatText = Regex.Escape(userEntry);
String status = Regex.Match(@"(?<=" + userFormatText + ").*");

That would allow the user to submit its format, like Status: , or Action: , or This is my friggin format, now please read the status --> ... 那将允许用户提交其格式,例如Status:Action:This is my friggin format, now please read the status --> ...

The Regex.Escape(userEntry) part is important to ensure that the user doesn't break your regex by submitting special character like \\ , ? Regex.Escape(userEntry)部分对于确保用户通过提交特殊字符(例如\\ Regex.Escape(userEntry)不会破坏您的正则表达式很重要? , * ... * ...


To know if the user submits the status value before or after the format text, you have several solutions: 要知道用户是在格式文本之前还是之后提交状态值,您有几种解决方案:

  • You could ask the user where his status value is, and then build you regex accordingly: 您可以询问用户其状态值在哪里,然后相应地构建您的正则表达式:

     if (statusValueIsAfter) { // Example: "Status: Closed" regexPattern = @"(?<=Status: ).*"; } else { // Example: "Closed:Status" regexPattern = @".*(?=:Status)"; // We use here a positive look-AHEAD } 
  • Or you could be smarter and introduce a system of tags for the user entry. 或者,您可以变得更聪明,并为用户输入引入标签系统。 For instance, the user submits Status: <value> or <value>=The status and you build the regex by replacing the tags string. 例如,用户提交Status: <value><value>=The status然后您通过替换标签字符串来构建正则表达式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM