简体   繁体   English

查找从Java中的.txt文件读取的字符串的特定元素

[英]Finding specific elements of a string read in from a .txt file in Java

I am a beginner in Java and am wondering how to read in specific elements from a string of DNA in a .txt file. 我是Java的初学者,想知道如何从.txt文件中的DNA字符串中读取特定元素。 For example, lets say that the text file contains the following: 例如,假设文本文件包含以下内容:

TAGAAAAGGGAAAGATAGT TAGAAAAGGGAAAGATAGT

I would like to know how to best iterate through the string and find particular sets of characters in order. 我想知道如何最好地遍历字符串并按顺序查找特定的字符集。 An example would be to find how many times "TAG" appears in the read-in string. 一个示例是查找“ TAG”出现在读入字符串中的次数。 Here's what I have so far: 这是我到目前为止的内容:

import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;

public class DNA {

public static void main(String args[]) {

    String fileName = args[0];
    Scanner s = null;

    try {
        s = new Scanner(new File(fileName));
    } catch (FileNotFoundException e) {
        e.printStackTrace();
        s.close();
    }

    String dna = "";

    while(s.hasNext()) {
        dna += s.next().trim();
    }
    s.close();

    String subsequence = "TAG";


    int count = 0;

    for (int i = 0; i < dna.length(); i++){
        if (dna.charAt(i) == subsequence.charAt(i)){

            count = count + 1;
            i++;
        }

    }
    while (dna.charAt() == subsequence.charAt()){
        count++;

    }


    System.out.println(subsequence + " appears " + count + " times");

}

}

It's messy and i'm attempting to use logic that i've found in other examples after many hours of searching. 杂乱无章,经过数小时的搜索,我试图使用在其他示例中找到的逻辑。 Please let me know how I can be more effective and use better logic! 请让我知道我如何才能更有效率并使用更好的逻辑! I love learning this stuff and am open to any corrections. 我喜欢学习这些东西,可以接受任何更正。

You could do this by using substring. 您可以使用子字符串来做到这一点。 Since TAG is 3 characters, you could take a substring from i -> i+3 each iteration of your loop and compare to "TAG". 由于TAG是3个字符,因此您可以在循环的每次迭代中从i-> i + 3中提取一个子字符串,然后与“ TAG”进行比较。

In an example of AGAAAAGGGAAAGATAGT, the loop would iterate as follows: 在AGAAAAGGGAAAGATAGT的示例中,循环将如下迭代:

"AGA".equals("TAG") “AGA” .equals( “TAG”)

"GAA".equals("TAG") “GAA” .equals( “TAG”)

"AAA".equals("TAG") “AAA” .equals( “TAG”)

"AAA".equals("TAG") “AAA” .equals( “TAG”)

"AAG".equals("TAG") “AAG” .equals( “TAG”)

"AGG".equals("TAG") “AGG” .equals( “TAG”)

"GGG".equals("TAG") “GGG” .equals( “TAG”)

etc. 等等

There is information here on substring if you're unfamiliar. 如果您不熟悉, 这里有关于子字符串的信息。 If this doesn't totally make sense I can try to explain more and provide psuedocode 如果这不完全有意义,我可以尝试解释更多并提供伪代码

In your loop, you are counting the occurrences of each character instead of the occurrence of your subsequence. 在循环中,您要计算每个字符的出现次数,而不是子序列的出现次数。 What you can do is compare your subsequence versus: 您可以做的是比较子序列与:

Substring of dnb of length 3 characters starting from i

I say 3 characters because your subsequence is "TAG" . 我说3个字符是因为您的子序列是"TAG" You can generalize that by storing the subsequence length in a variable. 您可以通过将子序列长度存储在变量中来概括这一点。

You also need to check if i + subsequence length is in the bounds of your string. 您还需要检查i + subsequence length是否在字符串的范围内。 Otherwise you will get an IndexOutOfBoundsException 否则,您将获得IndexOutOfBoundsException

Code: 码:

//current index i + sublen cannot exceed dna length

//portion of dna starting from i and going sublen characters has to equal subsequence

int countSubstring(String subsequence, String dna) {
    int count = 0;
    int sublen = subsequence.length();    // lenght of the subsequence
    for (int i = 0; i < dna.length(); i++){
        if ((i + sublen) < dna.length() && 
            dna.substring(i, i + sublen).equals(subsequence)){
            count = count + 1;
        }

    }
    return count;
}

Try loking at Rossetta Code for some example methods: 尝试查看Rossetta代码以获取一些示例方法:

The "remove and count the difference" method: “删除并计算差异”方法:

public int countSubstring(String subStr, String str){
    return (str.length() - str.replace(subStr, "").length()) / subStr.length();
}

The "split and count" method: “分割并计数”方法:

public int countSubstring(String subStr, String str){
    // the result of split() will contain one more element than the delimiter
    // the "-1" second argument makes it not discard trailing empty strings
    return str.split(Pattern.quote(subStr), -1).length - 1;
}

Manual looping (similar to the code I showed you at the top): 手动循环(类似于我在顶部显示的代码):

public int countSubstring(String subStr, String str){
    int count = 0;
    for (int loc = str.indexOf(subStr); loc != -1;
         loc = str.indexOf(subStr, loc + subStr.length()))
        count++;
    return count;
}

For your specific program, as far as reading from a file, you should put all reading operations inside the try block and then close your resources in a finally block. 对于特定程序,就从文件读取而言,应将所有读取操作放在try块中,然后在finally块中关闭资源。 If you want to read more about Java I/O go here and for the finally block go here . 如果您想了解更多关于Java I / O去这里并为finally块去这里 There are many ways to read information from a file, I just showed you one here that required the least amount of change to your code. 有很多方法可以从文件中读取信息,我在这里向您展示了一种对代码的更改最少的方法。

You can add any of the countSubstring methods to your code like: 您可以将任何countSubstring方法添加到您的代码中,例如:

public static void main(String args[]) {

    String fileName = args[0];
    Scanner s = null;
    String subsequence = "TAG";
    String dna = "";
    int count = 0;

    try {
        s = new Scanner(new File(fileName));
        while(s.hasNext()) {
            dna += s.next().trim();
        }
        count = countSubstring(subsequence, dna); // any of the above methods
        System.out.println(subsequence + " appears " + count + " times");
    } catch (FileNotFoundException e) {
        e.printStackTrace();
        // s.close(); Don't put s.close() here, use finally
    } finally {
        if(s != null) {
            s.close();
        }
    }
}

然后,您有dna字符串和子序列字符串,

int count = (dna.length() - line.replace(subsequence, "").length())/subsequence.length();

To search a string on a distinct pattern of characters the "Pattern" and "Matcher" classes are a good solution. 要在不同的字符模式上搜索字符串,“ Pattern”和“ Matcher”类是一个很好的解决方案。

Here is some code which can help to solve your problem: 这是一些可以帮助您解决问题的代码:

int count = 0;
String line = "T A G A A A A G G G A A A G A T A G T A G";
Pattern pattern = Pattern.compile("T A G");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) 
    count++;
System.out.println(count);  

The expression which is compiled by Pattern.compile(String s) is called Regex. 由Pattern.compile(String s)编译的表达式称为Regex。 In this case it simply looks for the occurence of "TAG" in the string. 在这种情况下,它只是在字符串中查找“ TAG”的出现。 With the while loop you can count the occurences. 使用while循环,您可以计算发生的次数。

Look for more informations about regex if you want to do more complicated things. 如果您想做更复杂的事情,请查找有关正则表达式的更多信息。

Instead of just counting instances of TAG, lets try to count multiple codons at once. 让我们尝试一次计数多个密码子,而不仅仅是计数TAG的实例。

public static final void main( String[] args )
{
    String input = "TACACTAGATCGCACTGCTAGTATC";
    if (args.length > 0) {
            input = args[0].trim();
    }
    System.out.println(input);

    HashMap<Character, Node> searchPatterns = createCodons();
    findCounts(input, searchPatterns);
    printCounts(searchPatterns);
}

This solution uses a tree to store the character sequences we are interesting in counting. 该解决方案使用一棵树来存储我们感兴趣的字符序列。 Each path from root to leaf in the tree represents a possible sequence. 树中从根到叶的每条路径代表一个可能的序列。 We'll create four trees; 我们将创建四棵树; codons starting with T, with A, with C, and with G. We'll store these trees in a HashMap for convenient retrieval by their starting character. 以T,A,C和G开头的密码子。我们将这些树存储在HashMap中,以方便按其起始字符进行检索。

/**
   Create a set of sequences we are interesting in finding (subset of 
  possible codons). We could specify any pattern we want here.
*/
public static final HashMap<Character, Node> createCodons()
{
    HashMap<Character, Node> codons = new HashMap<Character,Node>();

    Node sequencesOfT = new Node('T');         //   T
    Node nodeA = sequencesOfT.addChild('A');  //   /
    nodeA.addChild('C');                     //   A
    nodeA.addChild('G');                    //   / \
    codons.put('T', sequencesOfT);         //   C   G

    Node sequencesOfA = new Node('A');         //   A
    Node nodeT = sequencesOfA.addChild('T');  //   /
    nodeT.addChild('C');                     //   T
    nodeT.addChild('G');;                   //   / \
    codons.put('A', sequencesOfA);         //   C   G

    Node sequencesOfC = new Node('C');         //   C
    Node nodeG = sequencesOfC.addChild('G');  //   /
    nodeG.addChild('T');                     //   G
    nodeG.addChild('A');                    //   / \
    codons.put('C', sequencesOfC);         //   T   A

    Node sequencesOfG = new Node('G');         //   G
    Node nodeC = sequencesOfG.addChild('C');  //   /
    nodeC.addChild('T');                     //   C
    nodeC.addChild('A');                    //   / \
    codons.put('G', sequencesOfG);         //   T   A

    return codons;
}

Here's what our Node class looks like. 这是我们的Node类的样子。

public class Node
{
    public char data;            // the name of the node; A,C,G,T
    public int count = 0;        // we'll keep a count of occurrences here
    public Node parent = null;
    public List<Node> children;

    public Node( char data )
    {
        this.data = data;
        children = new ArrayList<Node>();
    }

    public Node addChild( char data )
    {
        Node node = new Node(data);
        node.parent = this;
        return (children.add(node) ? node : null);
    }

    public Node getChild( int index )
    {
        return children.get(index);
    }

    public int hasChild( char data )
    {
        int index = -1;
        int numChildren = children.size();
        for (int i=0; i<numChildren; i++)
        {
            Node child = children.get(i);
            if (child.data == data)
            {
                index = i;
                break;
            }
        }
        return index;
    }
}

To count the occurrences we'll iterate over each character of input, and for each iteration retrieve the tree (A, G, C, or T) that we are interested in. We then try to walk down the tree (from root to leaf) using the subsequent characters of input - we stop traversing when we're unable to find the next character of input in the node's list of children. 为了计算出现次数,我们将迭代输入的每个字符,并为每次迭代检索我们感兴趣的树(A,G,C或T)。然后,我们尝试沿着树(从根到叶)走下)使用输入的后续字符-当我们无法在节点的子代列表中找到输入的下一个字符时,我们将停止遍历。 At this point we increment the count on that node to indicate a sequence of characters was found ending at that node. 在这一点上,我们增加该节点上的计数,以表明找到了一个在该节点结束的字符序列。

public static final void findCounts(String input, HashMap<Character,Node> sequences)
{
    int n = input.length();
    for (int i=0; i<n; i++)
    {
        char root = input.charAt(i);
        Node sequence = sequences.get(root);

        int j = -1;
        int c = 1;
        while (((i+c) < n) && 
               ((j = sequence.hasChild(input.charAt(i+c))) != -1))
        {  
            sequence = sequence.getChild(j);
            c++;
        }
        sequence.count++;
    }
}

To print the results we'll walk each of the trees from root to leaf, printing the nodes as we encounter them, and printing the count upon reaching the leaf. 为了打印结果,我们将每棵树从根到叶,在遇到它们时打印节点,并在到达叶子时打印计数。

public static final void printCounts( HashMap<Character,Node> sequences )
{
    for (Node sequence : sequences.values()) 
    {
        printCounts(sequence, "");
    }
}

public static final void printCounts( Node sequence, String output )
{
    output = output + sequence.data;
    if (sequence.children.isEmpty()) 
    {
        System.out.println(output + ": " + sequence.count);
        return;
    }
    for (int i=0; i<sequence.children.size(); i++) 
    {
        printCounts( sequence.children.get(i), output );
    }
}

Here's some sample output: 这是一些示例输出:

TAGAAAAGGGAAAGATAGT
TAC: 0
TAG: 2
GCT: 0
GCA: 0
ATC: 0
ATG: 0
CGT: 0
CGA: 0

TAGCGTATC
TAC: 0
TAG: 1
GCT: 0
GCA: 0
ATC: 1
ATG: 0
CGT: 1
CGA: 0

From here we could easily extend the solution to keep a list of positions where each sequence was found, or record other information with respect to the input. 从这里开始,我们可以轻松地扩展解决方案,以保留找到每个序列的位置的列表,或者记录有关输入的其他信息。 This implementation is kind of rough, but hopefully this provides some insight into other ways you might approach your problem. 这种实现方式有些粗糙,但是希望它可以为您解决问题的其他方式提供一些见识。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM