简体   繁体   English

从文本文件(八度)创建一组带状疱疹

[英]Create set of shingles from a text file (octave)

I'm creating MinHash and LSH in Octave/Matlab. 我正在Octave / Matlab中创建MinHash和LSH。 But I'm trying to get a set (cell array or array) of shingles with k size from a given document and I don't know how to do it. 但是我试图从给定的文档中获取一组(单元格数组或数组)具有k大小的带状疱疹,但我不知道该怎么做。

What I have right now is this simple code: 我现在拥有的是以下简单代码:

doc = fopen(document);
i = 1;
while (! feof(doc) )
  txt{i} = strread(fgetl(doc), '%s');
  i++;
endwhile
fclose(doc);

This creates a cell array with all the words from each line of the document, which is an argument the function that I'm trying to do. 这将创建一个单元格数组,其中包含文档每一行中的所有单词,这是我要执行的功能的一个参数。

This code may do the trick. 这段代码可以解决问题。 It reads from a cell array and creates shingles (n-grams) of the specified size. 它从单元格数组中读取并创建指定大小的带状疱疹(n克)。

function S = shingles(txt, shingle_size)
  l = size(txt)(2) - shingle_size + 1;
  for i = 1:l
    t='';
    for j = i:(i + shingle_size - 2)
      t = strcat(t,txt{j},' ');
    end
    t = strcat(t, txt{i + shingle_size - 1});
    S{i} = t;
  end

You can test the code with the following example: 您可以使用以下示例测试代码:

txt={'a','b','c'}
shingles(txt, 2)
S =
{
  [1,1] = ab
  [1,2] = bc
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM