简体   繁体   English

来自mmap迭代的输入文件的2个不同输出,并在C中用块读取

[英]2 different output from an input file iterated from mmap and read with a chunk in C

I want to read a file which has some characters in it and check the percentage of printable characters as well as the percentage of white spaces. 我想读取一个包含一些字符的文件,然后检查可打印字符的百分比以及空格的百分比。 This is my Python code which generates the input file: 这是我的Python代码,它生成输入文件:

import string
import random
array = list()
array = list(string.printable)
print(array)
external = ['\0','\a','\b','\v','\f','\e']
array = array + external
file = open("in.txt" , 'w')
for i in range (1000):
        outputline = array[random.randrange(0,len(array)-1)]
        file.write(outputline)
file.close()

I want my file to have both printable characters and whitespaces and other characters (which are not in these two groups). 我希望我的文件同时具有可打印字符和空格以及其他字符(不在这两个组中)。 I do this in two ways: 我有两种方法:

  1. Read the file, chunk by chunk, with the read system call in C: 在C中使用read系统调用逐块读取文件:

     #include <stdio.h> #include <stdio.h> #include <ctype.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <errno.h> int main(int arg, char *argv[]) { char c ; char *data; int numOfWs = 0 ; int numOfPr = 0 ; int numberOfCharacters; int sizeOfBlock ; int nread ; int i=0; int k; float wsP = 0; float prP = 0; sizeOfBlock = atoi(argv[3]) ; data = malloc(sizeOfBlock*sizeof(char)); int fd = open(argv[2], O_RDONLY); while((nread = read(fd, data, sizeOfBlock)) > 0) { numberOfCharacters += nread ; for (i = 0; i < nread; ++i) { c = data[i] ; if(isprint(c)) numOfPr ++ ; else if(isspace (c)) numOfWs ++ ; } } wsP = (numOfWs / (float)numberOfCharacters)*100; prP = (numOfPr / (float)numberOfCharacters)*100 ; printf("%d printable characters out of %d bytes, %.2f%c\\n", numOfPr,numberOfCharacters,prP,'%'); printf("%d whitespace characters out of %d bytes, %.2f%c\\n", numOfWs,numberOfCharacters,wsP, '%'); exit(0); } 
  2. Copy the whole file into memory using mmap() and then start to read it from memory: 使用mmap()将整个文件复制到内存中,然后开始从内存中读取它:

     #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <unistd.h> #include <sys/types.h> #include <sys/mman.h> #include <sys/stat.h> #include <errno.h> #include <ctype.h> #include <string.h> int main(int arg, char *argv[]) { char c ; int i, numOfWs = 0, numOfPr = 0, numberOfCharacters = 0; char *data; float wsP = 0; float prP = 0; struct stat s; int fp = open("x.txt", O_RDWR); int status = fstat (fp, &s); int size = s.st_size; data = mmap((caddr_t)0, size, PROT_READ, MAP_SHARED, fp,0); for (i=0; i<size; i++){ char c; c = data[i]; if(isspace(c)) numOfWs ++; else if(isprint(c)) numOfPr ++; numberOfCharacters ++ ; } wsP = (numOfWs / (float)numberOfCharacters)*100; prP = (numOfPr / (float)numberOfCharacters)*100 ; printf("%d printable characters out of %d bytes, %.2f%c\\n", numOfPr,numberOfCharacters,prP,'%'); printf("%d whitespace characters out of %d bytes, %.2f%c\\n", numOfWs,numberOfCharacters,wsP, '%'); close(fp); exit(0); } 

when I check these two ways with the same file (created by the Python code) I get two different answers for the percentages but the number of characters which are read each way are the same (1000). 当我使用同一文件(由Python代码创建)检查这两种方式时,对于百分比,我会得到两个不同的答案,但是每种方式读取的字符数是相同的(1000)。

This is my file generated from the Python code (I don't know what happens when I copied it here it contains lots of control characters): 这是我的文件,它是通过Python代码生成的(我不知道在此处复制包含很多控制字符的文件时会发生什么情况):

el*2mlz_XyjP@%?Sw}Qo~.."tJ^~6,eN8+kq l)*N-1oupE
)coFKoA0\=|W'{Oezx~^p(B5ZJe!AdYb{Gflv&wwCf8}>3"v*>9\pW8PIs;qpX7RSk<9}&8B$u kNaq(mJK$N-!38?E%8-T,I1zC~0O=}FH*
x9x6Q%GT_C0j>7:@EG{N ?Eh$?18;Ncy[3 $'ikKs%:A].?e;i4`x"k!VD]}*pw
?\wE~Vix7^H~[26lsN?_GO$vz3M464S`+h=A(5@]q<&<+ hjmehAb-_3*3F8&#iM3p)6T`S9Q\yZwm$U`OHG}02{A)WcVzBR1h}H?qhF:P^-j5AQ1<7FD60j#B#}9Z=}2QReaYy|{Wv<^!yOC/7P}n*ZEPV2@8cU),=*5]]d a:3J;Y(?D?31$pcrquc#&PB;A[9lV+gJ%WZ6K~A|%^E_\3dM/?"y)BtUtG"3hf}W4,3DrXxTyl\UbWwCbMufqCNWx |hiJ\>43S6tCCS)rEo0.cz5PjgK0_AKN|8'g]byLp9AlrZDuK1OX,Csa}nu&i_p,#
Wyc{Q
LA\4:!WSq"ln|Pv.B;+N'h%O;tu(CgIh~OYIXCl+6~nSxBuybP nH:j;t'\vk&p}
,;3Ny#`Ug!rVbqExY|  %BVCD^D~z:L(j8L!    @   X4a!KBNCQ4z&3^9[O<fkM-qrOq5F/M*]yyU+-VLdZRtUu
a"=a b%c~GI|tcC/
P'/`t|hZ/2iHd94l"%;4-{)VUw%%3"e%IQ{RAX]NeMcsh&@LziT0)_T"2XADH&NYqa<6,$wdSp@LIMGA&,Gx1mj|t't?7=YtT77r<qi8;|tzi kOAi'dq%+g2   5hY?XTj{F)18.Vd!!$Q{D)}$7XxO)Vi%29*,P"cXkC,M|&brd&-DGF>V4 %N)a"VM+TQ$FI;YiL-0 YSxXgC@i~,o6/a7U2c"eGr\N7^B:'dytlOOS(iy\lhC7vnW,f o;vKUNa
Hg#u}W4N wUM

this is the result from mmap() : 这是mmap()的结果:

961 printable characters out of 1000 bytes, 96.10%
39 whitespace characters out of 1000 bytes, 3.90%

and this is the result from chunk reading : 这是chunk reading的结果:

974 printable characters out of 1000 bytes, 97.40%
26 whitespace characters out of 1000 bytes, 2.60%

Why is the number of printable characters different, but the file is the same in the two methods? 为什么printable characters不同,但是两种方法中的文件相同?

I think isspace() doesn't accept 我认为isspace()不接受 as a white space in chunked mode and instead counts it as a printable character . 在分块模式下为空白,而是将其视为printable character

You didn't initialize numberOfCharacters to 0 in the first program. 您没有在第一个程序中将numberOfCharacters初始化为0

That means that the value of numberOfCharacters is undetermined before 这意味着numberOfCharacters的值在之前未确定

numberOfCharacters += nread ;

is executed, this is a good reason to separate declaration from definition. 被执行,这是将声明与定义分开的一个很好的理由。

the problem was that 问题是 spaces are printable in the isprint() i change the order of isspace() and isprint() and use if()...else if() it becomes true. isprint()中可打印空格。我更改isspace()isprint()的顺序,并使用if()...else if()变为true。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM