查找從Java中的.txt文件讀取的字符串的特定元素

Question

我是Java的初學者，想知道如何從.txt文件中的DNA字符串中讀取特定元素。 例如，假設文本文件包含以下內容：

TAGAAAAGGGAAAGATAGT

我想知道如何最好地遍歷字符串並按順序查找特定的字符集。 一個示例是查找“ TAG”出現在讀入字符串中的次數。 這是我到目前為止的內容：

import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;

public class DNA {

public static void main(String args[]) {

    String fileName = args[0];
    Scanner s = null;

    try {
        s = new Scanner(new File(fileName));
    } catch (FileNotFoundException e) {
        e.printStackTrace();
        s.close();
    }

    String dna = "";

    while(s.hasNext()) {
        dna += s.next().trim();
    }
    s.close();

    String subsequence = "TAG";


    int count = 0;

    for (int i = 0; i < dna.length(); i++){
        if (dna.charAt(i) == subsequence.charAt(i)){

            count = count + 1;
            i++;
        }

    }
    while (dna.charAt() == subsequence.charAt()){
        count++;

    }


    System.out.println(subsequence + " appears " + count + " times");

}

}

雜亂無章，經過數小時的搜索，我試圖使用在其他示例中找到的邏輯。 請讓我知道我如何才能更有效率並使用更好的邏輯！ 我喜歡學習這些東西，可以接受任何更正。

Answer 1

您可以使用子字符串來做到這一點。 由於TAG是3個字符，因此您可以在循環的每次迭代中從i-> i + 3中提取一個子字符串，然后與“ TAG”進行比較。

在AGAAAAGGGAAAGATAGT的示例中，循環將如下迭代：

“AGA” .equals（ “TAG”）

“GAA” .equals（ “TAG”）

“AAA” .equals（ “TAG”）

“AAG” .equals（ “TAG”）

“AGG” .equals（ “TAG”）

“GGG” .equals（ “TAG”）

等等

如果您不熟悉，這里有關於子字符串的信息。 如果這不完全有意義，我可以嘗試解釋更多並提供偽代碼

Answer 2

在循環中，您要計算每個字符的出現次數，而不是子序列的出現次數。 您可以做的是比較子序列與：

Substring of dnb of length 3 characters starting from i

我說3個字符是因為您的子序列是"TAG" 。 您可以通過將子序列長度存儲在變量中來概括這一點。

您還需要檢查i + subsequence length是否在字符串的范圍內。 否則，您將獲得IndexOutOfBoundsException

碼：

//current index i + sublen cannot exceed dna length

//portion of dna starting from i and going sublen characters has to equal subsequence

int countSubstring(String subsequence, String dna) {
    int count = 0;
    int sublen = subsequence.length();    // lenght of the subsequence
    for (int i = 0; i < dna.length(); i++){
        if ((i + sublen) < dna.length() && 
            dna.substring(i, i + sublen).equals(subsequence)){
            count = count + 1;
        }

    }
    return count;
}

嘗試查看Rossetta代碼以獲取一些示例方法：

“刪除並計算差異”方法：

public int countSubstring(String subStr, String str){
    return (str.length() - str.replace(subStr, "").length()) / subStr.length();
}

“分割並計數”方法：

public int countSubstring(String subStr, String str){
    // the result of split() will contain one more element than the delimiter
    // the "-1" second argument makes it not discard trailing empty strings
    return str.split(Pattern.quote(subStr), -1).length - 1;
}

手動循環（類似於我在頂部顯示的代碼）：

public int countSubstring(String subStr, String str){
    int count = 0;
    for (int loc = str.indexOf(subStr); loc != -1;
         loc = str.indexOf(subStr, loc + subStr.length()))
        count++;
    return count;
}

對於特定程序，就從文件讀取而言，應將所有讀取操作放在try塊中，然后在finally塊中關閉資源。 如果您想了解更多關於Java I / O去這里並為finally塊去這里。 有很多方法可以從文件中讀取信息，我在這里向您展示了一種對代碼的更改最少的方法。

您可以將任何countSubstring方法添加到您的代碼中，例如：

public static void main(String args[]) {

    String fileName = args[0];
    Scanner s = null;
    String subsequence = "TAG";
    String dna = "";
    int count = 0;

    try {
        s = new Scanner(new File(fileName));
        while(s.hasNext()) {
            dna += s.next().trim();
        }
        count = countSubstring(subsequence, dna); // any of the above methods
        System.out.println(subsequence + " appears " + count + " times");
    } catch (FileNotFoundException e) {
        e.printStackTrace();
        // s.close(); Don't put s.close() here, use finally
    } finally {
        if(s != null) {
            s.close();
        }
    }
}

Answer 3

然后，您有dna字符串和子序列字符串，

int count = (dna.length() - line.replace(subsequence, "").length())/subsequence.length();

Answer 4

要在不同的字符模式上搜索字符串，“ Pattern”和“ Matcher”類是一個很好的解決方案。

這是一些可以幫助您解決問題的代碼：

int count = 0;
String line = "T A G A A A A G G G A A A G A T A G T A G";
Pattern pattern = Pattern.compile("T A G");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) 
    count++;
System.out.println(count);

由Pattern.compile（String s）編譯的表達式稱為Regex。 在這種情況下，它只是在字符串中查找“ TAG”的出現。 使用while循環，您可以計算發生的次數。

如果您想做更復雜的事情，請查找有關正則表達式的更多信息。

Answer 5

讓我們嘗試一次計數多個密碼子，而不僅僅是計數TAG的實例。

public static final void main( String[] args )
{
    String input = "TACACTAGATCGCACTGCTAGTATC";
    if (args.length > 0) {
            input = args[0].trim();
    }
    System.out.println(input);

    HashMap<Character, Node> searchPatterns = createCodons();
    findCounts(input, searchPatterns);
    printCounts(searchPatterns);
}

該解決方案使用一棵樹來存儲我們感興趣的字符序列。 樹中從根到葉的每條路徑代表一個可能的序列。 我們將創建四棵樹； 以T，A，C和G開頭的密碼子。我們將這些樹存儲在HashMap中，以方便按其起始字符進行檢索。

/**
   Create a set of sequences we are interesting in finding (subset of 
  possible codons). We could specify any pattern we want here.
*/
public static final HashMap<Character, Node> createCodons()
{
    HashMap<Character, Node> codons = new HashMap<Character,Node>();

    Node sequencesOfT = new Node('T');         //   T
    Node nodeA = sequencesOfT.addChild('A');  //   /
    nodeA.addChild('C');                     //   A
    nodeA.addChild('G');                    //   / \
    codons.put('T', sequencesOfT);         //   C   G

    Node sequencesOfA = new Node('A');         //   A
    Node nodeT = sequencesOfA.addChild('T');  //   /
    nodeT.addChild('C');                     //   T
    nodeT.addChild('G');;                   //   / \
    codons.put('A', sequencesOfA);         //   C   G

    Node sequencesOfC = new Node('C');         //   C
    Node nodeG = sequencesOfC.addChild('G');  //   /
    nodeG.addChild('T');                     //   G
    nodeG.addChild('A');                    //   / \
    codons.put('C', sequencesOfC);         //   T   A

    Node sequencesOfG = new Node('G');         //   G
    Node nodeC = sequencesOfG.addChild('C');  //   /
    nodeC.addChild('T');                     //   C
    nodeC.addChild('A');                    //   / \
    codons.put('G', sequencesOfG);         //   T   A

    return codons;
}

這是我們的Node類的樣子。

public class Node
{
    public char data;            // the name of the node; A,C,G,T
    public int count = 0;        // we'll keep a count of occurrences here
    public Node parent = null;
    public List<Node> children;

    public Node( char data )
    {
        this.data = data;
        children = new ArrayList<Node>();
    }

    public Node addChild( char data )
    {
        Node node = new Node(data);
        node.parent = this;
        return (children.add(node) ? node : null);
    }

    public Node getChild( int index )
    {
        return children.get(index);
    }

    public int hasChild( char data )
    {
        int index = -1;
        int numChildren = children.size();
        for (int i=0; i<numChildren; i++)
        {
            Node child = children.get(i);
            if (child.data == data)
            {
                index = i;
                break;
            }
        }
        return index;
    }
}

為了計算出現次數，我們將迭代輸入的每個字符，並為每次迭代檢索我們感興趣的樹（A，G，C或T）。然后，我們嘗試沿着樹（從根到葉）走下）使用輸入的后續字符-當我們無法在節點的子代列表中找到輸入的下一個字符時，我們將停止遍歷。 在這一點上，我們增加該節點上的計數，以表明找到了一個在該節點結束的字符序列。

public static final void findCounts(String input, HashMap<Character,Node> sequences)
{
    int n = input.length();
    for (int i=0; i<n; i++)
    {
        char root = input.charAt(i);
        Node sequence = sequences.get(root);

        int j = -1;
        int c = 1;
        while (((i+c) < n) && 
               ((j = sequence.hasChild(input.charAt(i+c))) != -1))
        {  
            sequence = sequence.getChild(j);
            c++;
        }
        sequence.count++;
    }
}

為了打印結果，我們將每棵樹從根到葉，在遇到它們時打印節點，並在到達葉子時打印計數。

public static final void printCounts( HashMap<Character,Node> sequences )
{
    for (Node sequence : sequences.values()) 
    {
        printCounts(sequence, "");
    }
}

public static final void printCounts( Node sequence, String output )
{
    output = output + sequence.data;
    if (sequence.children.isEmpty()) 
    {
        System.out.println(output + ": " + sequence.count);
        return;
    }
    for (int i=0; i<sequence.children.size(); i++) 
    {
        printCounts( sequence.children.get(i), output );
    }
}

這是一些示例輸出：

TAGAAAAGGGAAAGATAGT
TAC: 0
TAG: 2
GCT: 0
GCA: 0
ATC: 0
ATG: 0
CGT: 0
CGA: 0

TAGCGTATC
TAC: 0
TAG: 1
GCT: 0
GCA: 0
ATC: 1
ATG: 0
CGT: 1
CGA: 0

從這里開始，我們可以輕松地擴展解決方案，以保留找到每個序列的位置的列表，或者記錄有關輸入的其他信息。 這種實現方式有些粗糙，但是希望它可以為您解決問題的其他方式提供一些見識。

查找從Java中的.txt文件讀取的字符串的特定元素

問題描述

5 個解決方案

解決方案1
0 2014-09-13 18:52:26

解決方案2
0 已采納 2014-09-13 18:59:25

解決方案3
0 2014-09-13 19:04:01

解決方案4
0 2014-09-13 23:18:07

解決方案5
0 2014-09-14 00:40:06

查找從Java中的.txt文件讀取的字符串的特定元素

問題描述

5 個解決方案

解決方案1 0 2014-09-13 18:52:26

解決方案2 0 已采納 2014-09-13 18:59:25

解決方案3 0 2014-09-13 19:04:01

解決方案4 0 2014-09-13 23:18:07

解決方案5 0 2014-09-14 00:40:06

解決方案1
0 2014-09-13 18:52:26

解決方案2
0 已采納 2014-09-13 18:59:25

解決方案3
0 2014-09-13 19:04:01

解決方案4
0 2014-09-13 23:18:07

解決方案5
0 2014-09-14 00:40:06