简体   繁体   中英

How to find a matching string to a string from one text file in another text file?

I have two text files. Both of them have the same content but the formatting of each is different. In one file there are extra spaces between words or letters. There are different line breaks as well. For example:

File1:

The annotation framework we presented is 
embedded in the Knowledge Management and 
Acquisition Platform Semantic Turkey (Pazienza, et 
al., 2012), and comes out-the-box with a few 
annotation families which differ in the underlying 
annotation model and, notably, in the tasks they 
support. The default handlers take into consideration 
the annotation of atomic ontological resources, and 
complex activities that are provided as macros, e.g. 
the creation of new instances, the definition of new 
subclasses in OWL, or of narrower concepts in 
SKOS. 

File2:

Theannotationframework we presented is 
embedded in th e K n o w l e d ge Management and 
Acquisition Platform Semantic Turkey (Pazienza, et 
al., 2012), and comes out-the-
box with a few 
annotation families which differ in the underlying 
annotation model and, notably, in the tasks they 
support. The default handlers take into consideration 
the a n n o t a t i o n  o f a t o m i c ontological resources, and 
complex activities that are provided as macros, e.g. 
the creation of new instances, the definition of new 
subclasses in OWL, or of narrower concepts in 
SKOS.

Suppose I select the String the Knowledge Management from File1 and I want to match it with the String th e K nowled ge Management in File2.

How can I achieve it? There are no fixed deformities in the second file. Only surety is that the characters are in the same order in both the files and they could be possibly separated by extra spaces or the space between them could be missing.

I thought of applying Sellers Algorithm or Viterbi Algorithm but, I am not sure about it. Approximate string matching could be expensive as well.

Any lead would be helpful. Thanks a lot!

You should realize that you don't have two texts but virtually a single one, with all characters at the same position !

By what magic ? Well, it suffices to strip-off all white space and separators, or, better, skip them when you move forward from one character to the next.

You can easily traverse both texts in parallel, staying synchronized, and no search is necessary !

For example, both " the Knowledge Management " and " th e K nowled ge Management " run from position 45 to 67.


If you don't know the starting position of the search string in the first text, then perform an ordinary search in the first text (with or without spaces, that's up to you), and traverse the second text to the same position.

The annotation framework we presented is
0          1         2           3 
0122345678901223467890122344567890123345

If you need to perform numerous string locations in a text, traversing from the beginning every time becomes costly. Then you can use an index table that associates the whitespace-less location to the ordinary location, and perform binary searches when necessary.

You could import the files as strings, and remove all the white space from both. It should then be a straight string matching activity.

If you also need the start index of the matching pattern, get the index of the starting point in the collapsed string and run a for loop over the spaced out version, counting only characters.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM