简体   繁体   English

需要字符串匹配算法

[英]Need string matching algorithm

Data: 数据:

x.txt,simple text file(around 1 MB) y.txt dictionary file(around 1Lakh words). x.txt,简单文本文件(大约1 MB)y.txt字典文件(大约1拉赫单词)。

Need to find whether any of the word/s in y.txt is present in x.txt. 需要查找y.txt中是否有任何单词出现在x.txt中。

Need an algorithm which consumes less time for execution and language preferred for the same. 需要一种算法,它消耗较少的执行时间,并且首选相同的语言。

PS: Please suggest any algorithm apart from BRUTE FORCE METHOD. 附注:请提出除“强制力方法”之外的任何算法。

I need pattern matching rather than exact string matching. 我需要模式匹配而不是精确的字符串匹配。

For instance : 例如 :

x.txt : "The Old Buzzards were disestablished on 27 April" x.txt:“旧的秃鹰队在4月27日解散

Y.txt : " establish " Y.txt:“ 建立

Output should be : Found establish in X.txt : Line 1 输出应为:在X.txt中找到建立:第1行

Thank you. 谢谢。

It is not clear to me whether you need this to get a job done or it is home work. 我不清楚您是否需要完成工作还是家庭工作。 If you need it to get a job done then: 如果您需要它来完成工作,则:

#!/usr/bin/bash
Y=`cat y.txt | tr '\n' '|'`
echo "${Y%|}"
grep -E "${Y%|}" x.txt
if [ "$?" -eq 0 ]
then
    echo "found"
else
    echo "no luck"
fi

is hard to beat as you slurp in all the patterns from a file, construct a regular expression (the echo shows the regex) and then hand it to grep which constructs a finite state automata for you. 当您从文件中提取所有模式,构造一个正则表达式(回声显示正则表达式)然后将其交给grep ,这很难被击败,后者会为您构造一个有限状态自动机。 That is going to fly as it compares every character in the text at most once. 它最多可以一次比较文本中的每个字符,因此很快就会实现。 If it is homework then I suggest you consult Cormen et al 'Introduction to Algorithms', or the first few chapters of the Dragon Book which will also explain what I just said. 如果是家庭作业,那么我建议您咨询Cormen等人的“算法简介”或《龙书》的前几章,它们也将解释我刚才所说的内容。

Forgot to add: y.txt should contain your pattern one per line, but as a nice side effect your patterns do not have to be single words. 忘了补充:y.txt应该每行包含一个模式,但副作用是,模式不必是单个单词。

Suppose, you have any Set implementation in your standard library, here is some pseudo-code: 假设您在标准库中有任何Set实现,这是一些伪代码:

dictionary = empty set

def populate_dict():
    for word in dict_file:
        add(dictionary, word)

def validate_text(text_file):
    for word in text_file:      ### O(|text_file|)
        if word in dictionary:  ### O(log |dictonary|)
            report(word)

populate_dict()
every_now_and_then(populate_dict)

That would give you O(t * log d) instead of the brute-force O(t * d) where t and d are the lengths of the input text file and dictionary respectively. 这将为您提供O(t * log d)而不是蛮力O(t * d) ,其中td分别是输入文本文件和字典的长度。 I don't think that anything faster is possible since you can't read the file faster that O(t) and can't search faster than O(log d) . 我认为不可能有更快的速度,因为您无法以O(t)更快的速度读取文件,并且无法以比O(log d)更快的速度进行搜索。

This is a search algorithm I had in mind for a while. 这是我有一段时间想过的搜索算法。 Basically the algorithm is in two steps. 基本上,该算法分两个步骤。

In the first step all the words from y.txt are inserted in a tree. 第一步,将y.txt中的所有单词插入树中。 Every path in the tree from the root to a leaf is a word. 树中从根到叶的每条路径都是一个词。 The leaf is empty. 叶子是空的。

For example, the tree for the words dog and day is the following. 例如,单词“ dog and day”的树如下。

<root>--<d>-<a>-<y>-<>
          \-<o>-<g>-<>

The second part of the algorithm is a search down the tree. 该算法的第二部分是在树上搜索。 When you reach an empty leaf then you have found a word. 当您到达一片空的叶子时,您发现了一个词。

The implementation in Groovy, if more comments are needed just ask Groovy中的实现,如果需要更多注释,请询问

//create a tree to store the words in a compact and fast to search way
//each path of the tree from root to an empty leaf is a word
def tree = [:]
new File('y.txt').eachLine{ word->
    def t=tree
    word.each{ c ->
        if(!t[c]){
            t[c]=[:]
        }
        t=t[c]
    }
    t[0]=0//word terminator (the leaf)
}
println tree//for debug purpose
//search for the words in x.txt
new File('x.txt').eachLine{ str, line->
    for(int i=0; i<str.length(); i++){
        if(tree[str[i]]){
            def t=tree[str[i]]
            def res=str[i]
            def found=false
            for(int j=i+1; j<str.length(); j++){
                if(t[str[j]]==null){
                    if(found){
                        println "Found $res at line $line, col $i"
                        res=str[j]
                        found=false
                    }
                    break
                }else if(t[str[j]][0]==0){
                    found=true
                    res+=str[j]
                    t=t[str[j]]
                    continue
                }else{
                    t=t[str[j]]
                    res+=str[j]
                }
                found=false
            }
            if(found) println "Found $res at line $line, col $i"//I know, an ugly repetition, it's for words at the end of a line. I will fix this later
        }
    }
}

this is my y.txt 这是我的y.txt

dog
day
apple
daydream

and x.txt 和x.txt

This is a beautiful day and I'm walking with my dog while eating an apple.
Today it's sunny.
It's a daydream

The output is the following: 输出如下:

$ groovy search.groovy
[d:[o:[g:[0:0]], a:[y:[0:0, d:[r:[e:[a:[m:[0:0]]]]]]]], a:[p:[p:[l:[e:[0:0]]]]]]
Found day at line 1, col 20
Found dog at line 1, col 48
Found apple at line 1, col 68
Found day at line 2, col 2
Found daydream at line 3, col 7

This algorithm should be fast because the depth of the tree doesn't depend on the number of words in y.txt. 该算法应该很快,因为树的深度不取决于y.txt中的单词数。 The depth is equal to the length of the longest word in y.txt. 深度等于y.txt中最长单词的长度。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM