简体   繁体   English

使用通用规则从文本文件中提取内容

[英]Extracting content from text files with generic rules

I have a lot of text data with different structure. 我有很多具有不同结构的文本数据。 I need to extract parts of these texts based on some text-based rules. 我需要基于一些基于文本的规则来提取这些文本的一部分。 I would use regular expressions but unfortunately the people who are using the application have never heard of it. 我会使用正则表达式,但是不幸的是,使用该应用程序的人从未听说过它。

Basically the app does the following thing: 基本上,该应用执行以下操作:

  1. Load the data into a textbox 将数据加载到文本框中
  2. Type the structure of the output as a simple set of rules into another textbox 在另一个文本框中,将输出的结构作为一组简单的规则输入
  3. Receive the results in a 3rd textbox 在第三个文本框中接收结果

Examples of data structures (I have megabytes of this data): 数据结构的示例(我拥有此数据的兆字节):

Label1: value1, measurement
Label2; value2; something else
Nr, value3 (comment)
...

I need some other approach that I could use instead of regular expressions. 我需要其他一些方法来代替正则表达式。 It can be extremely simple because all I need is one value from every row. 这可能非常简单,因为我需要的只是每一行的一个值。

From the example above I have to obtain the following structure: 从上面的示例中,我必须获得以下结构:

"value1, value2, value3"

Is there a simpler alternative to regex? 有没有更简单的替代正则表达式? Did someone already implement something like this? 有人已经实现了这样的东西吗?

I can also imagine that I am approaching the problem from the wrong angle, like forcing the simple user to write data extraction rules. 我还可以想象,我正在从错误的角度解决问题,比如强迫简单的用户编写数据提取规则。 In this case the question is transformed to something more generic like "How can build an application that lets a very simple user extract data from a separate texts?" 在这种情况下,问题将转换为更通用的名称,例如“如何构建一个允许非常简单的用户从单独的文本中提取数据的应用程序?”

Edit: I have the following simplest as possible matching implemented for them: 编辑:我为他们实现了以下最简单的匹配:

File content: 档案内容:

"Strain at break Ax2";"Unknown"
"Strain at break Ax1";"Unknown"
"Strain at break";"Unknown"
"Yield point strain";"Unknown"
"Uniform elongation";25.4087;"%"
"Tensile strength";261.323;"MPa"
"End test phase Yield point";1;"%"
"Maximum tensile force";5.22647;"kN"

Pattern: 图案:

"Tensile strength";(?<value>[^;\n]*);
"Maximum tensile force";(?<value>[^;\n]*);

Still too complex. 还是太复杂了。 The problem is if I start replacing the ugly part with another string to obtain for example: 问题是如果我开始用另一个字符串替换丑陋的部分以获得例如:

"Tensile strength", [First value after]

I loose all the generic nature of the extraction because every file looks different from this one. 我放弃了提取的所有通用性质,因为每个文件看上去都与此文件不同。

Take a look at the FileHelpers library. 看一下FileHelpers库。 It allows runtime generation of file layouts and I think the one that would help in your example is the DelimitedClassBuilder . 它允许在运行时生成文件布局 ,我认为在您的示例中DelimitedClassBuilder的一个是DelimitedClassBuilder

In your case, I'd probably use FileHelpers to parse the record definitions into the DelimitedClassBuilder and then use the result to parse your records. 在您的情况下,我可能会使用FileHelpers将记录定义解析为DelimitedClassBuilder ,然后使用结果来解析您的记录。

I have solved the issue by defining the rules as regular expressions. 我已经通过将规则定义为正则表达式解决了该问题。 After the rules were defined I defined a wrapper rule-set that was easier to read by the users. 在定义了规则之后,我定义了一个包装规则集,使用户更易于阅读。

Ex. 例如 to extract a value from a line 从一行中提取一个值

Maximum amount of Sheet Drawing Force= 35.659695[kN]

I defined the regular expression 我定义了正则表达式

{0}=\s*(?<value>[^[\n\r]*)

then let the user define the name of the field. 然后让用户定义字段名称。 The {0} placeholder was then replaced with the name of the field and the regular expression applied. 然后将{0}占位符替换为字段名称并应用正则表达式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM