計算Matlab中的雙峰頻率

Question

我正在嘗試計算和繪制二元組頻率的分布首先，我確實生成了所有可能的二元組，從而得出了1296個二元組

然后我從給定的文件中提取二元組並將其保存為word1

我的問題是如何為文件a.txt計算這1296個雙字母組的頻率？ 如果文件中根本沒有出現一些雙字母組，則它們的頻率應為零

a.txt是任何文本文件

clear
clc
%************create bigrams 1296 ***************************************
chars ='1234567890abcdefghijklmonpqrstuvwxyz';
chars1 ='1234567890abcdefghijklmonpqrstuvwxyz';
bigram='';
for i=1:36
for j=1:36

bigram = sprintf('%s%s%s',bigram,chars(i),chars1(j));

end
end
temp1 = regexp(bigram, sprintf('\\w{1,%d}', 1), 'match');
temp2 = cellfun(@(x,y) [x '' y],temp1(1:end-1)', temp1(2:end)','un',0);
bigrams = temp2;
bigrams = unique(bigrams);
bigrams =  rot90(bigrams);
bigram = char(bigrams(1:end));
all_bigrams_len = length(bigrams);
clear temp temp1 temp2 i j chars1 chars;

%****** 1. Cleaning Data ******************************
collection = fileread('e:\a.txt');
collection = regexprep(collection,'<.*?>','');
collection = lower(collection);
collection = regexprep(collection,'\W','');
collection = strtrim(regexprep(collection,'\s*',''));

%*******************************************************

temp = regexp(collection, sprintf('\\w{1,%d}', 1), 'match');
temp2 = cellfun(@(x,y) [x '' y],temp(1:end-1)', temp(2:end)','un',0);
words1 = rot90(temp2);

%*******************************************************
words1_len = length(words1);
vocab1 = unique(words1);
vocab_len1 = length(vocab1);
[vocab1,void1,index1] = unique(words1);
frequencies1 = hist(index1,vocab_len1);

Answer 1

嘿，與Dennis解決方案類似，您可以只使用histc()

string1 = 'ASHRAFF'
histc(string1,'ABCDEFGHIJKLMNOPQRSTUVWXYZ')

這將檢查由字符串'ABCDEFGHIJKLMNOPQRSTUVWXYZ'定義的bin中的條目數，該字符串希望是字母（寫得很快，所以沒有保證）。 結果是：

  Columns 1 through 21

     2     0     0     0     0     2     0     1     0     0     0     0     0     0     0     0     0     1     1     0     0

  Columns 22 through 26

     0     0     0     0     0

只需對我的解決方案進行一些修改：

string1 = 'ASHRAFF'
alphabet1='A':'Z'; %%// as stated by Oleg Komarov
data=histc(string1,alphabet1);
results=cell(2,26);
for k=1:26
    results{1,k}= alphabet1(k);
    results{2,k}= data(k);
end

如果您現在查看results ，則可以輕松檢查它是否有效：D

Answer 2

一，字符串的字符計數問題

基於bsxfun的字符計數解決方案-

counts = sum(bsxfun(@eq,[string1-0]',65:90))

輸出-

counts =

    2     0     0     0     0     2     0     1     0     0 ....

如果您想獲得每個字母的計數列表輸出-

out = [cellstr(['A':'Z']') num2cell(counts)']

輸出-

out = 
    'A'    [2]
    'B'    [0]
    'C'    [0]
    'D'    [0]
    'E'    [0]
    'F'    [2]
    'G'    [0]
    'H'    [1]
    'I'    [0]

....

請注意，這是區分大小寫的大寫字母計數。

要進行小寫字母計數，請對此早期代碼使用此編輯-

counts = sum(bsxfun(@eq,[string1-0]',97:122))

對於不區分大小寫的計數，請使用此-

counts = sum(bsxfun(@eq,[upper(string1)-0]',65:90))

二。雙字計數盒

讓我們假設您已將所有可能的雙元組保存在一個1D單元格數組bigrams1並且來自文件的傳入雙元組被保存到另一個單元格數組words1 。 讓我們還假設其中的某些值以進行演示-

bigrams1 = {
    'ar';
    'de';
    'c3';
    'd1';
    'ry';
    't1';
    'p1'}

words1 = {
    'de';
    'c3';
    'd1';
    'r9';
    'yy';
    'de';
    'ry';
    'de';
    'dd';
    'd1'}

現在，您可以使用此代碼從存在於bigrams1中的bigrams1獲取words1的計數-

[~,~,ind] = unique(vertcat(bigrams1,words1));
bigrams_lb = ind(1:numel(bigrams1)); %// label bigrams1
words1_lb = ind(numel(bigrams1)+1:end);  %// label words1
counts = sum(bsxfun(@eq,bigrams_lb,words1_lb'),2)
out = [bigrams1 num2cell(counts)]

代碼運行的輸出是-

out = 
    'ar'    [0]
    'de'    [3]
    'c3'    [1]
    'd1'    [2]
    'ry'    [1]
    't1'    [0]
    'p1'    [0]

結果表明：-所有可能的雙words1組列表中的第一個元素ar在words1中都words1 ； 第二個元素de在words1出現了3次，依此類推。

Answer 3

此答案將創建所有雙字母組，對文件中的加載進行一些清理，然后使用unique和histc的組合對行進行計數

生成所有二元組

注意這里的順序很重要，因為唯一將對數組進行排序，這樣就可以對數組進行預排序，從而使輸出符合期望；

[y,x] = ndgrid(['0':'9','a':'z']);
allBigrams = [x(:),y(:)];

讀取文件

這將刪除大寫字母，並且只提取任何0-9或az字符，然后創建這些字符的列向量

fileText = lower(fileread('d:\loremipsum.txt'));
cleanText = regexp(fileText,'([a-z0-9])','tokens');
cleanText = cell2mat(vertcat(cleanText{:}));

通過移一並連接從文件創建二元組

fileBigrams = [cleanText(1:end-1),cleanText(2:end)];

獲取計數

所有二元組的集合將添加到我們的集合中（因此將為所有可能的值創建值）。 然后，使用unique的第三輸出將值∈{1,2，...，1296}分配給每個唯一行。 然后使用histc創建計數，使bin等於來自unique的輸出的一組值，從每個bin中減去1來刪除我們添加的完整的二元組

[~,~,c] = unique([fileBigrams;allBigrams],'rows');
counts = histc(c,1:1296)-1;

顯示

查看文本計數

[allBigrams, counts+'0']

或可能更有用的東西...

[sortedCounts,sortInd] = sort(counts,'descend');
[allBigrams(sortInd,:), sortedCounts+'0']


ans =

or9
at8
re8
in7
ol7
te7
do6 ...

Answer 4

沒有研究整個代碼片段，但是從問題頂部的示例來看，我認為您正在尋找一個直方圖：

string1 = 'ASHRAFF'
nr = histc(string1,'A':'Z')

會給你：

 2     0     0     0     0     2     0     1     0     0     0     0     0     0     0     0     0     1     1     0     0     0     0     0     0     0     0

（通過hist獲得了一個histc解決方案，但正如histc Minion所示，在這里histc更易於使用。）

請注意，此解決方案僅處理大寫字母。

如果要將小寫字母放入正確的bin中，則可能需要執行以下操作：

string1 = 'ASHRAFF'
nr = histc(upper(string1),'A':'Z')

或者，如果您希望它們分開顯示：

string1 = 'ASHRaFf'
nr = histc(upper(string1),['a':'z' 'A':'Z'])

Answer 5

bi_freq1 = zeros(1,all_bigrams_len);
for k=1: vocab_len1
 for i=1:all_bigrams_len
  if  char(vocab1(k)) == char(bigrams(i))
       bi_freq1(i) = frequencies1(k);
  end
 end
end

計算Matlab中的雙峰頻率

問題描述

5 個解決方案

解決方案1
2 2014-09-12 11:22:43

解決方案2
2 已采納 2014-09-12 12:25:58

一，字符串的字符計數問題

二。雙字計數盒

解決方案3
1 2014-09-12 14:39:43

生成所有二元組

讀取文件

獲取計數

顯示

解決方案4
0 2014-09-12 11:14:29

解決方案5
0 2014-09-12 14:13:02

計算Matlab中的雙峰頻率

問題描述

5 個解決方案

解決方案1 2 2014-09-12 11:22:43

解決方案2 2 已采納 2014-09-12 12:25:58

一，字符串的字符計數問題

二。 雙字計數盒

解決方案3 1 2014-09-12 14:39:43

生成所有二元組

讀取文件

獲取計數

顯示

解決方案4 0 2014-09-12 11:14:29

解決方案5 0 2014-09-12 14:13:02

解決方案1
2 2014-09-12 11:22:43

解決方案2
2 已采納 2014-09-12 12:25:58

二。雙字計數盒

解決方案3
1 2014-09-12 14:39:43

解決方案4
0 2014-09-12 11:14:29

解決方案5
0 2014-09-12 14:13:02