简体   繁体   English

如何从另一个文本文件中的一个文本文件中找到与字符串匹配的字符串?

[英]How to find a matching string to a string from one text file in another text file?

I have two text files. 我有两个文本文件。 Both of them have the same content but the formatting of each is different. 两者的内容相同,但格式不同。 In one file there are extra spaces between words or letters. 在一个文件中,单词或字母之间有多余的空格。 There are different line breaks as well. 也有不同的换行符。 For example: 例如:

File1: 文件1:

The annotation framework we presented is 
embedded in the Knowledge Management and 
Acquisition Platform Semantic Turkey (Pazienza, et 
al., 2012), and comes out-the-box with a few 
annotation families which differ in the underlying 
annotation model and, notably, in the tasks they 
support. The default handlers take into consideration 
the annotation of atomic ontological resources, and 
complex activities that are provided as macros, e.g. 
the creation of new instances, the definition of new 
subclasses in OWL, or of narrower concepts in 
SKOS. 

File2: 文件2:

Theannotationframework we presented is 
embedded in th e K n o w l e d ge Management and 
Acquisition Platform Semantic Turkey (Pazienza, et 
al., 2012), and comes out-the-
box with a few 
annotation families which differ in the underlying 
annotation model and, notably, in the tasks they 
support. The default handlers take into consideration 
the a n n o t a t i o n  o f a t o m i c ontological resources, and 
complex activities that are provided as macros, e.g. 
the creation of new instances, the definition of new 
subclasses in OWL, or of narrower concepts in 
SKOS.

Suppose I select the String the Knowledge Management from File1 and I want to match it with the String th e K nowled ge Management in File2. 假设我从“文件1”中选择了the Knowledge Management字符串the Knowledge Management ,并且希望将其与“文件2”中的“字符串th e K nowled ge Management匹配。

How can I achieve it? 我该如何实现? There are no fixed deformities in the second file. 第二个文件中没有固定的变形。 Only surety is that the characters are in the same order in both the files and they could be possibly separated by extra spaces or the space between them could be missing. 唯一可以确定的是,两个文件中的字符顺序相同,并且它们之间可能会被多余的空格隔开,或者它们之间的空格可能会丢失。

I thought of applying Sellers Algorithm or Viterbi Algorithm but, I am not sure about it. 我曾考虑应用卖方算法或维特比算法,但是我不确定。 Approximate string matching could be expensive as well. 近似字符串匹配也可能很昂贵。

Any lead would be helpful. 任何线索都将有所帮助。 Thanks a lot! 非常感谢!

You should realize that you don't have two texts but virtually a single one, with all characters at the same position ! 您应该意识到,您没有两个文本,实际上只有一个文本,所有字符都位于同一位置!

By what magic ? 通过什么魔术? Well, it suffices to strip-off all white space and separators, or, better, skip them when you move forward from one character to the next. 好了,剥离所有空白和分隔符就足够了,或者,当您从一个字符前进到下一个字符时,最好跳过它们。

You can easily traverse both texts in parallel, staying synchronized, and no search is necessary ! 您可以轻松地并行遍历两个文本,保持同步, 无需搜索

For example, both " the Knowledge Management " and " th e K nowled ge Management " run from position 45 to 67. 例如,两个“ the Knowledge Management ”和“ th e K nowled ge Management ”从位置45运行至67。


If you don't know the starting position of the search string in the first text, then perform an ordinary search in the first text (with or without spaces, that's up to you), and traverse the second text to the same position. 如果您不知道搜索字符串在第一个文本中的开始位置,请在第一个文本中执行普通搜索(有空格或无空格,这取决于您),然后将第二个文本遍历到相同位置。

The annotation framework we presented is
0          1         2           3 
0122345678901223467890122344567890123345

If you need to perform numerous string locations in a text, traversing from the beginning every time becomes costly. 如果您需要在文本中执行多个字符串位置,则每次从头开始遍历都会变得很昂贵。 Then you can use an index table that associates the whitespace-less location to the ordinary location, and perform binary searches when necessary. 然后,您可以使用将无空格位置与普通位置相关联的索引表,并在必要时执行二进制搜索。

You could import the files as strings, and remove all the white space from both. 您可以将文件作为字符串导入,并从这两个文件中删除所有空格。 It should then be a straight string matching activity. 然后,它应该是直接的字符串匹配活动。

If you also need the start index of the matching pattern, get the index of the starting point in the collapsed string and run a for loop over the spaced out version, counting only characters. 如果您还需要匹配模式的起始索引,请获取折叠字符串中起始点的索引,并在间隔版本上运行一次for循环,仅计算字符数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM