简体   繁体   English

Matlab图形直方图指示文件中每个字符的总和

[英]matlab plot histogram indicating sum of each character inside a file

I have 400 files, each one contains about 500000 character, and those 500000 characters consists only from about 20 letters. 我有400个文件,每个文件包含大约500000个字符,而这500000个字符仅由大约20个字母组成。 I want to make a histogram indicating the most 10 letters used (x-axis) and number of times each letter is used (y-axis). 我想制作一个直方图,指示使用的最多10个字母(x轴)和每个字母的使用次数(y轴)。 how can i make it. 我该怎么做。

Since you have an array of uchar , you know that your elements will always be in the range 0:255 . 由于您具有uchar数组,因此您知道元素将始终在0:255范围内。 After seeing Tamás Szabó's answer here I realized that the null character is exceedingly unlikely in a text file, so I will just ignore it and use the range 1:255 . 这里看到TamásSzabó的答案后我意识到在文本文件中完全不可能出现空字符,因此我将忽略它并使用范围1:255 If you expect to have null characters, you'll have to adjust the range. 如果希望使用空字符,则必须调整范围。

In order to find the 10 most frequently-used letters, we'll first calculate the histogram counts, then sort them in descending order and take the first 10: 为了找到10个最常用的字母,我们将首先计算直方图计数,然后按降序对其进行排序,并获取前10个:

counts = histc(uint8(part), [1:255]);
[topCounts, topIndices] = sort(counts, 'descend');

Now we need to rearrange the counts and indices to put the letters back in alphabetical order: 现在我们需要重新排列计数和索引,以按字母顺序将字母放回原位:

[sortedChars, shortIndices] = sort(topIndices(1:10));
sortedCounts = topCounts(shortIndices);

Now we can plot the histogram using bar : 现在我们可以使用bar绘制直方图:

bar(sortedCounts);

(You can add the 'hist' option if you want the bars in the graph touching like they do in the normal hist plot.) (您可以添加'hist'如果你想在触摸就像他们在正常做图的酒吧选项hist情节。)

To change the horizontal legend from numeric values to characters, use sortedChars as the 'XtickLabel' : 要将水平图例从数字值更改为字符,请使用sortedChars作为'XtickLabel'

labelChars = cellstr(sortedChars.').';
set(gca, 'XtickLabel', labelChars);

Note : This answers the original version of the question (the data consists of 10 letters only; a histogram is wanted). 注意 :这将回答问题的原始版本 (数据仅包含10个字母;需要直方图)。 The question was completely changed (the data consists of about 20 letters, and a histogram of the 10 most used letters is wanted). 该问题已完全更改 (数据由大约20个字母组成,并且需要10个最常用字母的直方图)。


If the ten letters are arbitrary and not known in advance, you can't use hist(..., 10) . 如果十个字母是任意的并且事先未知,则不能使用hist(..., 10) Consider the following example with three arbitrary "letters": 考虑以下带有三个任意“字母”的示例:

h = hist([1 2 2 10], 3);

The result is not [1 2 1] as you would expect. 结果不是您期望的[1 2 1] The problem is that hist chooses equal-width bins. 问题在于hist选择等宽的 bin。

Here are three approaches to do what you want: 以下是三种您可以做的事情:

  1. You can find the letters with unique and then do the sum with bsxfun : 您可以找到具有unique的字母,然后使用bsxfun

     letters = unique(part(:)).'; %'// these are the letters in your file h = sum(bsxfun(@eq, part(:), letters)); %// count occurrences of each letter 
  2. The second line of the above approach could be replaced by histc specifying the bin edges: 以上方法的第二行可以由histc替换,指定bin边缘:

     letters = unique(part(:)).'; h = histc(part, letters); 
  3. Or you could use sparse to do the accumulation: 或者您可以使用sparse来进行累加:

     t = sparse(1, part, 1); [~, letters, h] = find(t); 

As an example, for part = [1 2 2 10] either of the above gives the expected result, 例如,对于part = [1 2 2 10] ,以上任何一项都给出了预期的结果,

letters =
     1     2    10
h =
     1     2     1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM