Finding specific elements of a string read in from a .txt file in Java

Question

I am a beginner in Java and am wondering how to read in specific elements from a string of DNA in a .txt file. For example, lets say that the text file contains the following:

TAGAAAAGGGAAAGATAGT

I would like to know how to best iterate through the string and find particular sets of characters in order. An example would be to find how many times "TAG" appears in the read-in string. Here's what I have so far:

import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;

public class DNA {

public static void main(String args[]) {

    String fileName = args[0];
    Scanner s = null;

    try {
        s = new Scanner(new File(fileName));
    } catch (FileNotFoundException e) {
        e.printStackTrace();
        s.close();
    }

    String dna = "";

    while(s.hasNext()) {
        dna += s.next().trim();
    }
    s.close();

    String subsequence = "TAG";


    int count = 0;

    for (int i = 0; i < dna.length(); i++){
        if (dna.charAt(i) == subsequence.charAt(i)){

            count = count + 1;
            i++;
        }

    }
    while (dna.charAt() == subsequence.charAt()){
        count++;

    }


    System.out.println(subsequence + " appears " + count + " times");

}

}

It's messy and i'm attempting to use logic that i've found in other examples after many hours of searching. Please let me know how I can be more effective and use better logic! I love learning this stuff and am open to any corrections.

Answer 1

You could do this by using substring. Since TAG is 3 characters, you could take a substring from i -> i+3 each iteration of your loop and compare to "TAG".

In an example of AGAAAAGGGAAAGATAGT, the loop would iterate as follows:

"AGA".equals("TAG")

"GAA".equals("TAG")

"AAA".equals("TAG")

"AAG".equals("TAG")

"AGG".equals("TAG")

"GGG".equals("TAG")

etc.

There is information here on substring if you're unfamiliar. If this doesn't totally make sense I can try to explain more and provide psuedocode

Answer 2

In your loop, you are counting the occurrences of each character instead of the occurrence of your subsequence. What you can do is compare your subsequence versus:

Substring of dnb of length 3 characters starting from i

I say 3 characters because your subsequence is "TAG" . You can generalize that by storing the subsequence length in a variable.

You also need to check if i + subsequence length is in the bounds of your string. Otherwise you will get an IndexOutOfBoundsException

Code:

//current index i + sublen cannot exceed dna length

//portion of dna starting from i and going sublen characters has to equal subsequence

int countSubstring(String subsequence, String dna) {
    int count = 0;
    int sublen = subsequence.length();    // lenght of the subsequence
    for (int i = 0; i < dna.length(); i++){
        if ((i + sublen) < dna.length() && 
            dna.substring(i, i + sublen).equals(subsequence)){
            count = count + 1;
        }

    }
    return count;
}

Try loking at Rossetta Code for some example methods:

The "remove and count the difference" method:

public int countSubstring(String subStr, String str){
    return (str.length() - str.replace(subStr, "").length()) / subStr.length();
}

The "split and count" method:

public int countSubstring(String subStr, String str){
    // the result of split() will contain one more element than the delimiter
    // the "-1" second argument makes it not discard trailing empty strings
    return str.split(Pattern.quote(subStr), -1).length - 1;
}

Manual looping (similar to the code I showed you at the top):

public int countSubstring(String subStr, String str){
    int count = 0;
    for (int loc = str.indexOf(subStr); loc != -1;
         loc = str.indexOf(subStr, loc + subStr.length()))
        count++;
    return count;
}

For your specific program, as far as reading from a file, you should put all reading operations inside the try block and then close your resources in a finally block. If you want to read more about Java I/O go here and for the finally block go here . There are many ways to read information from a file, I just showed you one here that required the least amount of change to your code.

You can add any of the countSubstring methods to your code like:

public static void main(String args[]) {

    String fileName = args[0];
    Scanner s = null;
    String subsequence = "TAG";
    String dna = "";
    int count = 0;

    try {
        s = new Scanner(new File(fileName));
        while(s.hasNext()) {
            dna += s.next().trim();
        }
        count = countSubstring(subsequence, dna); // any of the above methods
        System.out.println(subsequence + " appears " + count + " times");
    } catch (FileNotFoundException e) {
        e.printStackTrace();
        // s.close(); Don't put s.close() here, use finally
    } finally {
        if(s != null) {
            s.close();
        }
    }
}

Answer 3

然后，您有dna字符串和子序列字符串，

int count = (dna.length() - line.replace(subsequence, "").length())/subsequence.length();

Answer 4

To search a string on a distinct pattern of characters the "Pattern" and "Matcher" classes are a good solution.

Here is some code which can help to solve your problem:

int count = 0;
String line = "T A G A A A A G G G A A A G A T A G T A G";
Pattern pattern = Pattern.compile("T A G");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) 
    count++;
System.out.println(count);

The expression which is compiled by Pattern.compile(String s) is called Regex. In this case it simply looks for the occurence of "TAG" in the string. With the while loop you can count the occurences.

Look for more informations about regex if you want to do more complicated things.

Answer 5

Instead of just counting instances of TAG, lets try to count multiple codons at once.

public static final void main( String[] args )
{
    String input = "TACACTAGATCGCACTGCTAGTATC";
    if (args.length > 0) {
            input = args[0].trim();
    }
    System.out.println(input);

    HashMap<Character, Node> searchPatterns = createCodons();
    findCounts(input, searchPatterns);
    printCounts(searchPatterns);
}

This solution uses a tree to store the character sequences we are interesting in counting. Each path from root to leaf in the tree represents a possible sequence. We'll create four trees; codons starting with T, with A, with C, and with G. We'll store these trees in a HashMap for convenient retrieval by their starting character.

/**
   Create a set of sequences we are interesting in finding (subset of 
  possible codons). We could specify any pattern we want here.
*/
public static final HashMap<Character, Node> createCodons()
{
    HashMap<Character, Node> codons = new HashMap<Character,Node>();

    Node sequencesOfT = new Node('T');         //   T
    Node nodeA = sequencesOfT.addChild('A');  //   /
    nodeA.addChild('C');                     //   A
    nodeA.addChild('G');                    //   / \
    codons.put('T', sequencesOfT);         //   C   G

    Node sequencesOfA = new Node('A');         //   A
    Node nodeT = sequencesOfA.addChild('T');  //   /
    nodeT.addChild('C');                     //   T
    nodeT.addChild('G');;                   //   / \
    codons.put('A', sequencesOfA);         //   C   G

    Node sequencesOfC = new Node('C');         //   C
    Node nodeG = sequencesOfC.addChild('G');  //   /
    nodeG.addChild('T');                     //   G
    nodeG.addChild('A');                    //   / \
    codons.put('C', sequencesOfC);         //   T   A

    Node sequencesOfG = new Node('G');         //   G
    Node nodeC = sequencesOfG.addChild('C');  //   /
    nodeC.addChild('T');                     //   C
    nodeC.addChild('A');                    //   / \
    codons.put('G', sequencesOfG);         //   T   A

    return codons;
}

Here's what our Node class looks like.

public class Node
{
    public char data;            // the name of the node; A,C,G,T
    public int count = 0;        // we'll keep a count of occurrences here
    public Node parent = null;
    public List<Node> children;

    public Node( char data )
    {
        this.data = data;
        children = new ArrayList<Node>();
    }

    public Node addChild( char data )
    {
        Node node = new Node(data);
        node.parent = this;
        return (children.add(node) ? node : null);
    }

    public Node getChild( int index )
    {
        return children.get(index);
    }

    public int hasChild( char data )
    {
        int index = -1;
        int numChildren = children.size();
        for (int i=0; i<numChildren; i++)
        {
            Node child = children.get(i);
            if (child.data == data)
            {
                index = i;
                break;
            }
        }
        return index;
    }
}

To count the occurrences we'll iterate over each character of input, and for each iteration retrieve the tree (A, G, C, or T) that we are interested in. We then try to walk down the tree (from root to leaf) using the subsequent characters of input - we stop traversing when we're unable to find the next character of input in the node's list of children. At this point we increment the count on that node to indicate a sequence of characters was found ending at that node.

public static final void findCounts(String input, HashMap<Character,Node> sequences)
{
    int n = input.length();
    for (int i=0; i<n; i++)
    {
        char root = input.charAt(i);
        Node sequence = sequences.get(root);

        int j = -1;
        int c = 1;
        while (((i+c) < n) && 
               ((j = sequence.hasChild(input.charAt(i+c))) != -1))
        {  
            sequence = sequence.getChild(j);
            c++;
        }
        sequence.count++;
    }
}

To print the results we'll walk each of the trees from root to leaf, printing the nodes as we encounter them, and printing the count upon reaching the leaf.

public static final void printCounts( HashMap<Character,Node> sequences )
{
    for (Node sequence : sequences.values()) 
    {
        printCounts(sequence, "");
    }
}

public static final void printCounts( Node sequence, String output )
{
    output = output + sequence.data;
    if (sequence.children.isEmpty()) 
    {
        System.out.println(output + ": " + sequence.count);
        return;
    }
    for (int i=0; i<sequence.children.size(); i++) 
    {
        printCounts( sequence.children.get(i), output );
    }
}

Here's some sample output:

TAGAAAAGGGAAAGATAGT
TAC: 0
TAG: 2
GCT: 0
GCA: 0
ATC: 0
ATG: 0
CGT: 0
CGA: 0

TAGCGTATC
TAC: 0
TAG: 1
GCT: 0
GCA: 0
ATC: 1
ATG: 0
CGT: 1
CGA: 0

From here we could easily extend the solution to keep a list of positions where each sequence was found, or record other information with respect to the input. This implementation is kind of rough, but hopefully this provides some insight into other ways you might approach your problem.

Finding specific elements of a string read in from a .txt file in Java

Question

5 answers

solution1
0 2014-09-13 18:52:26

solution2
0 ACCPTED 2014-09-13 18:59:25

solution3
0 2014-09-13 19:04:01

solution4
0 2014-09-13 23:18:07

solution5
0 2014-09-14 00:40:06

Finding specific elements of a string read in from a .txt file in Java

Question

5 answers

solution1 0 2014-09-13 18:52:26

solution2 0 ACCPTED 2014-09-13 18:59:25

solution3 0 2014-09-13 19:04:01

solution4 0 2014-09-13 23:18:07

solution5 0 2014-09-14 00:40:06

solution1
0 2014-09-13 18:52:26

solution2
0 ACCPTED 2014-09-13 18:59:25

solution3
0 2014-09-13 19:04:01

solution4
0 2014-09-13 23:18:07

solution5
0 2014-09-14 00:40:06