简体   繁体   English

C#比较相似的字符串

[英]C# comparing similar strings

I have a generic with some filenames (LIST1) and another biggeneric with a full list of names (LIST2). 我有一个带有一些文件名(LIST1)的泛型,另一个是带有完整名称列表(LIST2)的biggeneric。 I need to match names from LIST1 to similar ones in LIST2. 我需要将LIST1中的名称与LIST2中的相似名称进行匹配。 For example 例如

LIST1
- **MAIZE_SLIP_QUANTITY_3_9.1.aif**

LIST 2
1- TUTORIAL_FAILURE_CLINCH_4.1.aif
2- **MAIZE_SLIP_QUANTITY_3_5.1.aif**
3- **MAIZE_SLIP_QUANTITY_3_9.2.aif**
4- TUTORIAL_FAILURE_CLINCH_5.1.aif
5- TUTORIAL_FAILURE_CLINCH_6.1.aif
6- TUTORIAL_FAILURE_CLINCH_7.1.aif
7- TUTORIAL_FAILURE_CLINCH_8.1.aif
8- TUTORIAL_FAILURE_CLINCH_9.1.aif
9- TUTORIAL_FAILURE_PUSH_4.1.aif

I've read about Levenshtein distance and used an implementation of it in a Framework ( SignumFramework Utilities ). 我已经阅读了有关Levenshtein距离的信息,并在Framework( SignumFramework Utilities )中使用了它的实现。 It returns me distance=1 in lines 2 and 3. But in my case line 3 is a better match than line 2. 它在第2行和第3行中向我返回distance = 1。但是在我的情况下,第3行比第2行更匹配。

Is there another method better to compare similar strings? 还有另一种方法可以更好地比较相似的字符串吗? Something more flexible? 有什么更灵活的方法吗?

When comparing as strings, "9.2" is not a better match than "5.1" for "9.1". 作为字符串进行比较时,“ 9.2”的匹配度比“ 5.1”的“ 5.1”更好。 If you want the version numbers to be evaluated numerically, you have to parse the strings so that you can compare the string parts and the numerical parts separately. 如果要对版本号进行数字评估,则必须解析字符串,以便可以分别比较字符串部分和数字部分。

有一个simlar问题在这里 ,也许有些答案会出现有关?

Your similarity criteria could be a combination of several other criteria. 您的相似性标准可以是其他几个条件的组合。 One could be the Levenshtein distance, others might eg be the longest common substring or prefix/suffix. 一个可能是Levenshtein距离,另一个可能是例如最长的公用子字符串或前缀/后缀。

The longest common substring problem is actually a special case of edit distance, when substitutions are forbidden and only exact character match, insert, and delete are allowable edit operations (see here ). 最长的常见子字符串问题实际上是编辑距离的一种特殊情况,当替换被禁止并且仅精确的字符匹配,插入和删除是允许的编辑操作时(请参见此处 )。

Further metrics for string similarity are described here . 字符串相似性的其他指标在此处介绍。

A regular expression could be used to get the items that match the name. 正则表达式可用于获取与名称匹配的项目。 The version number could be collected in a regex group in the match and parsed into a .net object (eg decimal) that you could use to compare which one was closest. 可以在比赛中的正则表达式组中收集版本号,并将其解析为.net对象(例如十进制),您可以使用该对象比较哪个是最接近的。

There's a fairly exhaustive set of answers to this SO question . 这个SO问题有相当详尽的答案。 At the bottom is link I put up to C# implementations for soundex, double metaphone, PHP similarity and levenstein. 底部的链接是我为soundex,双重元音,PHP相似性和levenstein建立了C#实现。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM