假设我有32个这样的字符串:

GCAAAGCTTGGCACACGTCAAGAGTTGACTTT

我的目标是计算所有出现的特定子字符串,例如'AA''ATT''CGG'等等。 为此,上面的第3至第5个字符包含2个出现的“ AA”。 这些子字符串总共有8个,长度为3个字符的6个子字符串,长度为2个字符的2个子字符串,我希望对所有这8个字符进行计数。

用Java执行此操作的最有效方法是什么? 我的想法如下:

  1. 逐字符扫描,检查并标记每个子字符串。 这似乎是密集且低效的。
  2. 找到一些可以完成这项工作的现有函数(不确定函数的效率,String.contains是布尔值,而不是计数)。
  3. 扫描字符串多次,每次扫描检查一个不同的子字符串。

3的实现很简单,但是1可能会带来一些额外的麻烦,并且代码不是很干净。

===============>>#1 票数:0 已采纳

我认为这应该可以回答您的问题。

天真的方法(检查每个可能的索引处的子字符串)在O(nk)中运行,其中n是字符串的长度,k是子字符串的长度。 这可以通过for循环来实现,例如haystack.substring(i).startsWith(needle)。

尽管存在更有效的算法。 您可能想看看Knuth-Morris-Pratt算法或Aho-Corasick算法。 与天真的方法相反,这两种算法在输入上也表现良好,例如“在10000'X的字符串中查找100'X'的子字符串。

取自stackoverflow.com/questions/4121875/count-of-substrings-in-string

===============>>#2 票数:0

一种方法是从根本上编码NFA( http://en.wikipedia.org/wiki/Nondeterministic_finite_automaton ),然后在NFA上运行您的输入。

这是我尝试编码NFA的尝试。 您可能需要先转换为DFA,然后再运行它,这样就不必管理大量分支。 使用分支时,它的运行速度基本上与O(nk)一样慢,而如果转换为DFA,则速度为O(n)

import java.util.*;

public class Test
{
    public static void main (String[] args)
    {
        new Test();
    }

    private static final String input = "TAAATGGAGGTAATAGAGGAGGTGTAT";
    private static final String[] substrings = new String[] { "AA", "AG", "GG", "GAG", "TA" };
    private static final int[] occurrences = new int[substrings.length];

    public Test()
    {
        ArrayList<Branch> branches = new ArrayList<Branch>();

        //  For each character, read it, create branches for each substring, and pass the current character
        //  to each active branch
        for (int i = 0; i < input.length(); i++)
        {
            char c = input.charAt(i);

            //  Make a new branch, one for each substring that we are searching for
            for (int j = 0; j < substrings.length; j++)
                branches.add(new Branch(substrings[j], j, branches));

            //  Pass the current input character to each branch that is still alive
            //  Iterate in reverse order because the nextCharacter method may
            //  cause the branch to be removed from the ArrayList
            for (int j = branches.size()-1; j >= 0; j--)
                branches.get(j).nextCharacter(c);
        }

        for (int i = 0; i < occurrences.length; i++)
            System.out.println(substrings[i]+": "+occurrences[i]);
    }

    private static class Branch
    {
        private String searchFor;
        private int position, index;
        private ArrayList<Branch> parent;

        public Branch(String searchFor, int searchForIndex, ArrayList<Branch> parent)
        {
            this.parent = parent;
            this.searchFor = searchFor;
            this.position = 0;
            this.index = searchForIndex;
        }

        public void nextCharacter(char c)
        {
            //  If the current character matches the ith character of the string we are searching for,
            //  Then this branch will stay alive
            if (c == searchFor.charAt(position))
                position++;
            //  Otherwise the substring didn't match, so this branch dies
            else
                suicide();

            //  Reached the end of the substring, so the substring was found.
            if (position == searchFor.length())
            {
                occurrences[index] += 1;
                suicide();
            }
        }

        private void suicide()
        {
            parent.remove(this);
        }
    }
}

此示例的输出为AA:3 AG:4 GG:4 GAG:3 TA:4

===============>>#3 票数:0

是否要查找所有可能超过1个字符的子字符串? 在那种情况下,一种方法是使用HashMaps。

此示例输出:{AA = 3,TT = 4,AC = 3,CTT = 2,CAA = 2,GCA = 2,CAC = 2,AG = 3,TTG = 2,AAG = 2,GT = 2,CT = 2,TG = 2,GA = 2,GC = 3,CA = 4}

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;

public class Test {
    public static void main(String[] args) {
        String str = "GCAAAGCTTGGCACACGTCAAGAGTTGACTTT";
        HashMap<String, Integer> map = countMatches(str);
        System.out.println(map);
    }

    private static HashMap<String, List<Integer>> findOneLetterMatches(String str) {
        ArrayList<Integer> list = new ArrayList<>();
        for(int i = 0; i < str.length(); i++) list.add(i);
        return extendMatches(str, list, 1);
    }

    private static HashMap<String, List<Integer>> extendMatches(String str, List<Integer> indices, int targetLength) {
        HashMap<String, List<Integer>> map = new HashMap<>();
        for(int index: indices) {
            if(index+targetLength <= str.length()) {
                String s = str.substring(index, index + targetLength);
                List<Integer> list = map.get(s);
                if(list == null) {
                    list = new ArrayList<>();
                    map.put(s, list);
                }
                list.add(index);
            }
        }
        return map;
    }

    private static void addIfListLongerThanOne(HashMap<String, List<Integer>> source,
                                               HashMap<String, List<Integer>> target) {
        for(Map.Entry<String, List<Integer>> e: source.entrySet()) {
            String s = e.getKey();
            List<Integer> l = e.getValue();
            if(l.size() > 1) target.put(s, l);
        }
    }

    private static HashMap<String, List<Integer>> extendAllMatches(String str, HashMap<String, List<Integer>> map, int targetLength) {
        HashMap<String, List<Integer>> result = new HashMap<>();
        for(List<Integer> list: map.values()) {
            HashMap<String, List<Integer>> m = extendMatches(str, list, targetLength);
            addIfListLongerThanOne(m, result);
        }
        return result;
    }

    private static HashMap<String, Integer> countMatches(String str) {
        HashMap<String, Integer> result = new HashMap<>();
        HashMap<String, List<Integer>> matches = findOneLetterMatches(str);
        for(int targetLength = 2; !matches.isEmpty(); targetLength++) {
            HashMap<String, List<Integer>> m = extendAllMatches(str, matches, targetLength);
            for(Map.Entry<String, List<Integer>> e: m.entrySet()) {
                String s = e.getKey();
                List<Integer> l = e.getValue();
                result.put(s, l.size());
            }
            matches = m;
        }
        return result;
    }
}

  ask by Marshall Tigerus translate from so

未解决问题?本站智能推荐: