简体   繁体   English

使用C获取文件中每一行的长度并写入输出文件

[英]Get the length of each line in file with C and write in output file

I am a biology student and I am trying to learn perl, python and C and also use the scripts in my work. 我是一名生物学专业的学生,​​我试图学习perl,python和C,并且还在我的工作中使用这些脚本。 So, I have a file as follows: 因此,我有一个文件如下:

>sequence1
ATCGATCGATCG
>sequence2
AAAATTTT
>sequence3
CCCCGGGG  

The output should look like this, that is the name of each sequence and the count of characters in each line and printing the total number of sequences in the end of the file. 输出应如下所示,即每个序列的名称以及每一行中的字符数,并在文件末尾打印序列的总数。

sequence1 12
sequence2 8
sequence3 8
Total number of sequences = 3

I could make the perl and python scripts work, this is the python script as an example: 我可以使perl和python脚本正常工作,以python脚本为例:

#!/usr/bin/python

import sys

my_file = open(sys.argv[1]) #open the file
my_output = open(sys.argv[2], "w") #open output file

total_sequence_counts = 0

for line in my_file: 
    if line.startswith(">"):
        sequence_name = line.rstrip('\n').replace(">","") 
        total_sequence_counts += 1 
        continue    
    dna_length = len(line.rstrip('\n')) 
    my_output.write(sequence_name + " " + str(dna_length) + '\n')
my_output.write("Total number of sequences = " + str(total_sequence_counts) + '\n')

Now, I want to write the same script in C, this is what I have achieved so far: 现在,我想用C编写相同的脚本,这是我到目前为止已经实现的:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char *argv[])
{
input = FILE *fopen(const char *filename, "r");
output = FILE *fopen(const char *filename, "w"); 

double total_sequence_counts = 0;
char sequence_name[];

char line [4095]; // set a temporary line length
char buffer = (char *) malloc (sizeof(line) +1); // allocate some memory

while (fgets(line, sizeof(line), filename) != NULL) { // read until new line character is not found in line

    buffer = realloc(*buffer, strlen(line) + strlen(buffer) + 1); // realloc buffer to adjust buffer size
    if (buffer == NULL) { // print error message if memory allocation fails
        printf("\n Memory error");
        return 0;
    }
    if (line[0] == ">") {
        sequence_name = strcpy(sequence_name, &line[1]); 
        total_sequence_counts += 1
        }
        else {
            double length = strlen(line);
            fprintf(output, "%s \t %ld", sequence_name, length);
        }
    fprintf(output, "%s \t %ld", "Total number of sequences = ", total_sequence_counts);
}
    int fclose(FILE *input); // when you are done working with a file, you should close it using this function. 
    return 0;
    int fclose(FILE *output);
    return 0;
}

But this code, of course is full of mistakes, my problem is that despite studying a lot, I still can't properly understand and use the memory allocation and pointers so I know I especially have mistakes in that part. 但是这段代码当然充满了错误,我的问题是,尽管学习了很多,但我仍然无法正确理解和使用内存分配和指针,因此我知道我在那部分尤其有错误。 It would be great if you could comment on my code and see how it can turn into a script that actually work. 如果您可以对我的代码发表评论,看看它如何变成可以实际运行的脚本,那就太好了。 By the way, in my actual data, the length of each line is not defined so I need to use malloc and realloc for that purpose. 顺便说一句,在我的实际数据中,每行的长度都没有定义,因此我需要为此使用malloc和realloc。

For a simple program like this, where you look at short lines one at a time, you shouldn't worry about dynamic memory allocation. 对于像这样的简单程序,您一次只看几行,就不必担心动态内存分配。 It is probably good enough to use local buffers of a reasonable size. 使用合理大小的本地缓冲区可能已经足够了。

Another thing is that C isn't particularly suited for quick-and-dirty string processing. 另一件事是,C不特别适合于快速和肮脏的字符串处理。 For example, there isn't a strstrip function in the standard library. 例如,标准库中没有strstrip函数。 You usually end up implementing such behaviour yourself. 您通常最终自己实现这种行为。

An example implementation looks like this: 一个示例实现如下所示:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <ctype.h>



#define MAXLEN 80       /* Maximum line length, including null terminator */

int main(int argc, char *argv[])
{
    FILE *in;
    FILE *out;

    char line[MAXLEN];          /* Current line buffer */
    char ref[MAXLEN] = "";      /* Sequence reference buffer */
    int nseq = 0;               /* Sequence counter */

    if (argc != 3) {
        fprintf(stderr, "Usage: %s infile outfile\n", argv[0]);
        exit(1);
    }

    in = fopen(argv[1], "r");
    if (in == NULL) {
        fprintf(stderr, "Couldn't open %s.\n", argv[1]);
        exit(1);        
    }

    out = fopen(argv[2], "w");
    if (in == NULL) {
        fprintf(stderr, "Couldn't open %s for writing.\n", argv[2]);
        exit(1);        
    }

    while (fgets(line, sizeof(line), in)) {
        int len = strlen(line);

        /* Strip whitespace from end */
        while (len > 0 && isspace(line[len - 1])) len--;
        line[len] = '\0';

        if (line[0] == '>') {
            /* First char is '>': copy from second char in line */
            strcpy(ref, line + 1);
        } else {
            /* Other lines are sequences */
            fprintf(out, "%s: %d\n", ref, len);
            nseq++;
        }
    }

    fprintf(out, "Total number of sequences. %d\n", nseq);

    fclose(in);
    fclose(out);

    return 0;
}

A lot of code is about enforcing arguments and opening and closing files. 许多代码是关于强制参数以及打开和关闭文件的。 (You could cut out a lot of code if you used stdin and stdout with file redirections.) (如果您将stdinstdout用于文件重定向,则可能会削减很多代码。)

The core is the big while loop. 核心是大的while循环。 Things to note: 注意事项:

  • fgets returns NULL on error or when the end of file is reached. 如果发生错误或到达文件末尾, fgets将返回NULL
  • The first lines determine the length of the line and then remove white-space from the end. 前几行确定行的长度,然后从末尾删除空白。
  • It is not enough to decrement length, at the end the stripped string must be terminated with the null character '\\0' 减少长度是不够的,最后剥离的字符串必须以空字符'\\0'终止
  • When you check the first character in the line, you should check against a char, not a string. 当您检查该行中的第一个字符时,应检查一个字符,而不是字符串。 In C, single and double quotes are not interchangeable. 在C中,单引号和双引号不可互换。 ">" is a string literal of two characters, '>' and the terminating '\\0' . ">"是由两个字符'>'和结束的'\\0'的字符串文字。
  • When dealing with countable entities like chars in a string, use integer types, not floating-point numbers. 处理字符串中的char等可数实体时,请使用整数类型,而不要使用浮点数。 (I've used (signed) int here, but because there can't be a negative number of chars in a line, it might have been better to have used an unsigned type.) (我在这里使用(签名) int ,但是因为一行中不能有负数个字符,所以使用无符号类型可能会更好。)
  • The notation line + 1 is equivalent to &line[1] . 注释line + 1等效于&line[1]
  • The code I've shown doesn't check that there is always one reference per sequence. 我显示的代码不会检查每个序列始终有一个引用。 I'll leave this as exercide to the reader. 我会将其作为练习留给读者。

For a beginner, this can be quite a lot to keep track of. 对于初学者来说,这可能是很多需要跟踪的地方。 For small text-processing tasks like yours, Python and Perl are definitely better suited. 对于像您这样的小型文本处理任务,Python和Perl绝对更适合。

Edit : The solution above won't work for long sequences; 编辑 :上面的解决方案不适用于长序列; it is restricted to MAXLEN characters. 仅限MAXLEN字符。 But you don't need dynamic allocation if you only need the length, not the contents of the sequences. 但是,如果只需要长度而不是序列的内容,则不需要动态分配。

Here's an updated version that doesn't read lines, but read characters instead. 这是一个不读取行,而是读取字符的更新版本。 In '>' context, it stored the reference. '>'上下文中,它存储了引用。 Otherwise it just keeps a count: 否则,它只会保持计数:

#include <stdlib.h>
#include <stdio.h>
#include <ctype.h>      /* for isspace() */



#define MAXLEN 80       /* Maximum line length, including null terminator */

int main(int argc, char *argv[])
{
    FILE *in;
    FILE *out;

    int nseq = 0;               /* Sequence counter */
    char ref[MAXLEN];           /* Reference name */

    in = fopen(argv[1], "r");
    out = fopen(argv[2], "w");

    /* Snip: Argument and file checking as above */

    while (1) {
        int c = getc(in);

        if (c == EOF) break;

        if (c == '>') {
            int n = 0;

            c = fgetc(in);
            while (c != EOF && c != '\n') {
                if (n < sizeof(ref) - 1) ref[n++] = c;
                c = fgetc(in);
            }
            ref[n] = '\0';
        } else {
            int len = 0;
            int n = 0;

            while (c != EOF && c != '\n') {
                n++;
                if (!isspace(c)) len = n;
                c = fgetc(in);
            }

            fprintf(out, "%s: %d\n", ref, len);
            nseq++;
        }
    }

    fprintf(out, "Total number of sequences. %d\n", nseq);

    fclose(in);
    fclose(out);

    return 0;
}

Notes: 笔记:

  • fgetc reads a single byte from a file and returns this byte or EOF when the file has ended. fgetc从文件读取一个字节,并在文件结束时返回此字节或EOF In this implementation, that's the only reading function used. 在此实现中,这是唯一使用的阅读功能。
  • Storing a reference string is implemented via fgetc here too. 存储引用字符串也是通过fgetc在这里实现的。 You could probably use fgets after skipping the initial angle bracket, too. 您也可以在跳过初始尖括号后使用fgets
  • The counting just reads bytes without storing them. 计数只是读取字节而不存储它们。 n is the total count, len is the count up to the last non-space. n是总计数, len是直到最后一个非空格的计数。 (Your lines probably consist only of ACGT without any trailing space, so you could skip the test for space and use n instead of len .) (您的行可能仅由ACGT组成,没有任何尾随空格,因此您可以跳过空格测试,并使用n代替len 。)
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char *argv[]){
    FILE *my_file = fopen(argv[1], "r");
    FILE *my_output = fopen(argv[2], "w");
    int total_sequence_coutns = 0;
    char *sequence_name;
    int dna_length;
    char *line = NULL;
    size_t size = 0;

    while(-1 != getline(&line, &size, my_file)){
        if(line[0] == '>'){
            sequence_name = strdup(strtok(line, ">\n"));
            total_sequence_coutns +=1;
            continue;
        }
        dna_length = strlen(strtok(line, "\n"));
        fprintf(my_output, "%s %d\n", sequence_name, dna_length);
        free(sequence_name);
    }
    fprintf(my_output, "Total number of sequences = %d\n", total_sequence_coutns);
    fclose(my_file);
    fclose(my_output);
    free(line);
    return (0);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM