[英]Tool to normalize text source and build original source from normalized one
有人知道Java上的工具/项目可以规范文本(并存储规范日志)然后构建原始源文本吗?
任何方法都值得赞赏。
问题:为了处理输入数据,我们需要对其进行规范化。
流程引擎接收规范化的文本并返回匹配的位置。
在此步骤之后,我们需要通过归一化位置恢复原始源等效项。
例:
Source:
Lorem ipsum ad his scripta blandit partiendo, eum fastidii accumsan euripidis in, eum liber hendrerit an ... ütf Wórd èxämplé
Normalized text (approx):
lorem ipsum scripta blandit partiendo, fastidi accumsan euripidis, liber hendrerit utf word example
Engine output:
lorem ipsum scripta begin 0 end 19
euripidis begin 56 end 65
Original source equivalent:
Lorem ipsum ad his scripta begin 0 end 26
euripidis begin 69 end 78
感谢帮助
解决此问题的最佳方法是使用Regex
:
// Given
Source:
Lorem ipsum ad his scripta blandit partiendo, eum fastidii accumsan euripidis in, eum liber hendrerit an ... ütf Wórd èxämplé
Stopwords:
ad, his, eum, in, an
ASCII text:
Lorem ipsum ad his scripta blandit partiendo, eum fastidii accumsan euripidis in, eum liber hendrerit an ... utf Word example
Normalized text (approx):
lorem ipsum scripta blandit partiendo, fastidi accumsan euripidis, liber hendrerit utf word example
// Then
Engine output:
lorem ipsum scripta begin 0 end 19
euripidis begin 56 end 65
To take original text from normalized, used Regex
lorem( (ad|his|eum|in|an))* ipsum( (ad|his|eum|in|an))* scripta
euripidis
// Verify
Original source equivalent:
Lorem ipsum ad his scripta begin 0 end 26
euripidis begin 69 end 78
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.