[英]Get the length of each line in file with C and write in output file
I am a biology student and I am trying to learn perl, python and C and also use the scripts in my work. 我是一名生物学专业的学生,我试图学习perl,python和C,并且还在我的工作中使用这些脚本。 So, I have a file as follows:
因此,我有一个文件如下:
>sequence1
ATCGATCGATCG
>sequence2
AAAATTTT
>sequence3
CCCCGGGG
The output should look like this, that is the name of each sequence and the count of characters in each line and printing the total number of sequences in the end of the file. 输出应如下所示,即每个序列的名称以及每一行中的字符数,并在文件末尾打印序列的总数。
sequence1 12
sequence2 8
sequence3 8
Total number of sequences = 3
I could make the perl and python scripts work, this is the python script as an example: 我可以使perl和python脚本正常工作,以python脚本为例:
#!/usr/bin/python
import sys
my_file = open(sys.argv[1]) #open the file
my_output = open(sys.argv[2], "w") #open output file
total_sequence_counts = 0
for line in my_file:
if line.startswith(">"):
sequence_name = line.rstrip('\n').replace(">","")
total_sequence_counts += 1
continue
dna_length = len(line.rstrip('\n'))
my_output.write(sequence_name + " " + str(dna_length) + '\n')
my_output.write("Total number of sequences = " + str(total_sequence_counts) + '\n')
Now, I want to write the same script in C, this is what I have achieved so far: 现在,我想用C编写相同的脚本,这是我到目前为止已经实现的:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char *argv[])
{
input = FILE *fopen(const char *filename, "r");
output = FILE *fopen(const char *filename, "w");
double total_sequence_counts = 0;
char sequence_name[];
char line [4095]; // set a temporary line length
char buffer = (char *) malloc (sizeof(line) +1); // allocate some memory
while (fgets(line, sizeof(line), filename) != NULL) { // read until new line character is not found in line
buffer = realloc(*buffer, strlen(line) + strlen(buffer) + 1); // realloc buffer to adjust buffer size
if (buffer == NULL) { // print error message if memory allocation fails
printf("\n Memory error");
return 0;
}
if (line[0] == ">") {
sequence_name = strcpy(sequence_name, &line[1]);
total_sequence_counts += 1
}
else {
double length = strlen(line);
fprintf(output, "%s \t %ld", sequence_name, length);
}
fprintf(output, "%s \t %ld", "Total number of sequences = ", total_sequence_counts);
}
int fclose(FILE *input); // when you are done working with a file, you should close it using this function.
return 0;
int fclose(FILE *output);
return 0;
}
But this code, of course is full of mistakes, my problem is that despite studying a lot, I still can't properly understand and use the memory allocation and pointers so I know I especially have mistakes in that part. 但是这段代码当然充满了错误,我的问题是,尽管学习了很多,但我仍然无法正确理解和使用内存分配和指针,因此我知道我在那部分尤其有错误。 It would be great if you could comment on my code and see how it can turn into a script that actually work.
如果您可以对我的代码发表评论,看看它如何变成可以实际运行的脚本,那就太好了。 By the way, in my actual data, the length of each line is not defined so I need to use malloc and realloc for that purpose.
顺便说一句,在我的实际数据中,每行的长度都没有定义,因此我需要为此使用malloc和realloc。
For a simple program like this, where you look at short lines one at a time, you shouldn't worry about dynamic memory allocation. 对于像这样的简单程序,您一次只看几行,就不必担心动态内存分配。 It is probably good enough to use local buffers of a reasonable size.
使用合理大小的本地缓冲区可能已经足够了。
Another thing is that C isn't particularly suited for quick-and-dirty string processing. 另一件事是,C不特别适合于快速和肮脏的字符串处理。 For example, there isn't a
strstrip
function in the standard library. 例如,标准库中没有
strstrip
函数。 You usually end up implementing such behaviour yourself. 您通常最终自己实现这种行为。
An example implementation looks like this: 一个示例实现如下所示:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#define MAXLEN 80 /* Maximum line length, including null terminator */
int main(int argc, char *argv[])
{
FILE *in;
FILE *out;
char line[MAXLEN]; /* Current line buffer */
char ref[MAXLEN] = ""; /* Sequence reference buffer */
int nseq = 0; /* Sequence counter */
if (argc != 3) {
fprintf(stderr, "Usage: %s infile outfile\n", argv[0]);
exit(1);
}
in = fopen(argv[1], "r");
if (in == NULL) {
fprintf(stderr, "Couldn't open %s.\n", argv[1]);
exit(1);
}
out = fopen(argv[2], "w");
if (in == NULL) {
fprintf(stderr, "Couldn't open %s for writing.\n", argv[2]);
exit(1);
}
while (fgets(line, sizeof(line), in)) {
int len = strlen(line);
/* Strip whitespace from end */
while (len > 0 && isspace(line[len - 1])) len--;
line[len] = '\0';
if (line[0] == '>') {
/* First char is '>': copy from second char in line */
strcpy(ref, line + 1);
} else {
/* Other lines are sequences */
fprintf(out, "%s: %d\n", ref, len);
nseq++;
}
}
fprintf(out, "Total number of sequences. %d\n", nseq);
fclose(in);
fclose(out);
return 0;
}
A lot of code is about enforcing arguments and opening and closing files. 许多代码是关于强制参数以及打开和关闭文件的。 (You could cut out a lot of code if you used
stdin
and stdout
with file redirections.) (如果您将
stdin
和stdout
用于文件重定向,则可能会削减很多代码。)
The core is the big while
loop. 核心是大的
while
循环。 Things to note: 注意事项:
fgets
returns NULL
on error or when the end of file is reached. fgets
将返回NULL
。 '\\0'
'\\0'
终止 ">"
is a string literal of two characters, '>'
and the terminating '\\0'
. ">"
是由两个字符'>'
和结束的'\\0'
的字符串文字。 int
here, but because there can't be a negative number of chars in a line, it might have been better to have used an unsigned type.) int
,但是因为一行中不能有负数个字符,所以使用无符号类型可能会更好。) line + 1
is equivalent to &line[1]
. line + 1
等效于&line[1]
。 For a beginner, this can be quite a lot to keep track of. 对于初学者来说,这可能是很多需要跟踪的地方。 For small text-processing tasks like yours, Python and Perl are definitely better suited.
对于像您这样的小型文本处理任务,Python和Perl绝对更适合。
Edit : The solution above won't work for long sequences; 编辑 :上面的解决方案不适用于长序列; it is restricted to
MAXLEN
characters. 仅限
MAXLEN
字符。 But you don't need dynamic allocation if you only need the length, not the contents of the sequences. 但是,如果只需要长度而不是序列的内容,则不需要动态分配。
Here's an updated version that doesn't read lines, but read characters instead. 这是一个不读取行,而是读取字符的更新版本。 In
'>'
context, it stored the reference. 在
'>'
上下文中,它存储了引用。 Otherwise it just keeps a count: 否则,它只会保持计数:
#include <stdlib.h>
#include <stdio.h>
#include <ctype.h> /* for isspace() */
#define MAXLEN 80 /* Maximum line length, including null terminator */
int main(int argc, char *argv[])
{
FILE *in;
FILE *out;
int nseq = 0; /* Sequence counter */
char ref[MAXLEN]; /* Reference name */
in = fopen(argv[1], "r");
out = fopen(argv[2], "w");
/* Snip: Argument and file checking as above */
while (1) {
int c = getc(in);
if (c == EOF) break;
if (c == '>') {
int n = 0;
c = fgetc(in);
while (c != EOF && c != '\n') {
if (n < sizeof(ref) - 1) ref[n++] = c;
c = fgetc(in);
}
ref[n] = '\0';
} else {
int len = 0;
int n = 0;
while (c != EOF && c != '\n') {
n++;
if (!isspace(c)) len = n;
c = fgetc(in);
}
fprintf(out, "%s: %d\n", ref, len);
nseq++;
}
}
fprintf(out, "Total number of sequences. %d\n", nseq);
fclose(in);
fclose(out);
return 0;
}
Notes: 笔记:
fgetc
reads a single byte from a file and returns this byte or EOF
when the file has ended. fgetc
从文件读取一个字节,并在文件结束时返回此字节或EOF
。 In this implementation, that's the only reading function used. fgetc
here too. fgetc
在这里实现的。 You could probably use fgets
after skipping the initial angle bracket, too. fgets
。 n
is the total count, len
is the count up to the last non-space. n
是总计数, len
是直到最后一个非空格的计数。 (Your lines probably consist only of ACGT without any trailing space, so you could skip the test for space and use n
instead of len
.) n
代替len
。) #include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char *argv[]){
FILE *my_file = fopen(argv[1], "r");
FILE *my_output = fopen(argv[2], "w");
int total_sequence_coutns = 0;
char *sequence_name;
int dna_length;
char *line = NULL;
size_t size = 0;
while(-1 != getline(&line, &size, my_file)){
if(line[0] == '>'){
sequence_name = strdup(strtok(line, ">\n"));
total_sequence_coutns +=1;
continue;
}
dna_length = strlen(strtok(line, "\n"));
fprintf(my_output, "%s %d\n", sequence_name, dna_length);
free(sequence_name);
}
fprintf(my_output, "Total number of sequences = %d\n", total_sequence_coutns);
fclose(my_file);
fclose(my_output);
free(line);
return (0);
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.