人们如何使用n-gram进行情绪分析，考虑到当n增加时，内存需求也会迅速增加？

Question

I am trying to do Sentiment Analysis on Tweets using Python. 我正在尝试使用Python对Tweets进行情感分析。

To begin with, I've implemented an n-grams model. 首先，我实施了一个n-gram模型。 So, lets say our training data is 所以，让我们说我们的训练数据是

I am a good kid

He is a good kid, but he didn't get along with his sister much

Unigrams : Unigrams ：

<i, am, a, good, kid, he, but, didnt, get, along, with, his, sister, much>

Bigrams : Bigrams ：

<(i am), (am a), (a good), (good kid), (he is), (is a), (kid but), (but he), (he didnt), (didnt get), (get along), (along with), (with his), (his sister), (sister much)>

Trigrams : Trigrams ：

<(i am a), (am a good), (a good kid), .........>

Final feature vector : 最终特征向量 ：

<i, am, a, good, kid, he, but, didnt, get, along, with, his, sister, much, (i am), (am a), (a good), (good kid), (he is), (is a), (kid but), (but he), (he didnt), (didnt get), (get along), (along with), (with his), (his sister), (sister much), (i am a), (am a good), (a good kid), .........>

When we do this for a large training data, of 8000 or so entries, the dimensionality of the feature vector becomes too HUGE, as a result of which, my computer (RAM=16GB) crashes. 当我们为8000个左右的大型训练数据执行此操作时，特征向量的维数变得太大，导致我的计算机（RAM = 16GB）崩溃。

So, when people mention using "n-grams" as features, in 100s of papers out there, what are they talking about? 所以，当人们提到使用“n-gram”作为特征时，在那里的100多篇论文中，他们在谈论什么？ Am I doing something wrong? 难道我做错了什么？

Do people always do some feature selection for "n-grams"? 人们总是为“n-gram”做一些功能选择吗？ If so, what kind of feature selection should I look into? 如果是这样，我应该选择什么样的功能选择？

I am using scikit-learn to do this 我正在使用scikit-learn来做到这一点

Answer 1

If you store your final feature vector exactly as you wrote, I think I could come up with some improvements. 如果你完全按照你写的方式存储你的最终特征向量，我想我可以想出一些改进。

The memory issue is due to the fact that features (the texts) are repeated so many times, and so are the tokens. 内存问题是由于功能（文本）重复这么多次，令牌也是如此。 Consider this process: 考虑这个过程：

First of all, all the distinct features are stored (and given an index). 首先，存储所有不同的特征（并给出索引）。

For example, 例如，

1--feature1--(i am) 1 - feature1 - （我是）

2--feature2--(am a) 2 - feature2 - （am a）

... ...

This generates a so-called feature space. 这会生成一个所谓的特征空间。

There might be thousands of features in total, or even more. 总共可能有数千个功能，甚至更多功能。 But that should be normal. 但这应该是正常的。 Then each entry could be rewrote as a serial of numbers such as, 然后每个条目可以重写为一系列数字，如，

Entry1----- <1,1,1,0,....a_n>, where the first 1 means feature1(i am) has 1 occurrence in this entry, and a_n is the number of occurrence of feature n. Entry1 ----- <1,1,1,0，... a_n>，其中第一个1表示feature1（i）在该条目中出现1次，a_n是要素n的出现次数。

Let's assume there are many features and the entries are short, which means in each vector there are way too many zeros. 让我们假设有很多特征，条目很短，这意味着在每个向量中都有太多的零。 We can rewrite the previous vector as following, 我们可以将以前的向量重写如下，

Entry1----{1:1,2:1,3:1}, which means the value of feature 1/2/3 of Entry1 is 1, and values of all the other features are zeros. Entry1 ---- {1：1,2：1,3：1}，表示Entry1的1/2 / 3特征的值为1，所有其他特征的值为零。 Shorter, isn't it? 更短，不是吗？

In the end each entry is represented as a short vector, and you get a big matrix for your corpus. 最后，每个条目都表示为一个短向量，并为您的语料库提供一个大矩阵。 Your corpus might look like this now: 你的语料库现在看起来像这样：

{1:1, 2:1, 3:1} {1：1,2：1,3：1}

{2:1, 29:1, 1029:1, 20345:1} {2：1,29：1,029：1,20345：1}

... ...

16G RAM is sufficient for 8000 entries. 16G RAM足以容纳8000个条目。 You can use much less. 你可以少用。

And further more, if you get too many distinct tokens (which means toooo many features). 而且，如果你得到太多不同的令牌（这意味着很多功能）。 When constructing the feature space, what can be done is to remove features whose frequencies are lower than a threshold, say 3 times. 构建特征空间时，可以做的是删除频率低于阈值的特征，比如3次。 The size of the feature space could be deducted to half, or even smaller. 特征空间的大小可以减去一半甚至更小。

Answer 2

As inspectorG4dget said in the comments, you rarely go to the high n-grams, eg n=5 or n=6, because you will not have enough training data to make it worthwhile. 正如检查员G4dget在评论中所说，你很少去高n-gram，例如n = 5或n = 6，因为你没有足够的训练数据来使它值得。 In other words almost all of your 6-grams will have an occurrence count of 1. Also, to quote inspectorG4dget's comment: 换句话说，几乎所有6克的出现次数都是1.另外，引用inspectorG4dget的评论：

When these papers talk about n-grams, they're not talking about a scalable n - they're USUALLY talking about a specific n (whose value might be revealed in the results or experiments section) 当这些论文谈论n-gram时，他们并不是在谈论可扩展的n - 他们通常在谈论特定的n（其值可能在结果或实验部分中显示）

So, usually memory is not the biggest concern. 所以，通常记忆不是最大的问题。 With really large corpus you will split them across a cluster, then combine results at the end. 使用非常大的语料库，您将在群集中拆分它们，然后在最后组合结果。 You might split based on how much memory each node in the cluster has, or if processing a stream then you might stop and upload results (to the central node) each time you fill memory. 您可以根据群集中每个节点的内存量进行拆分，或者如果处理流，则可以在每次填充内存时停止并上载结果（到中心节点）。

There are some optimizations you can do. 您可以进行一些优化。 If the corpus is being held in memory, then each n-gram only needs to be an index to the first occurrence in the corpus; 如果语料库被保存在存储器中，那么每个n-gram只需要是语料库中第一次出现的索引; the string does not need to be repeated. 字符串不需要重复。

A second optimization, if you don't mind multiple passes, is to use the (n-1)-gram result to skip over sentence parts below your threshold. 如果您不介意多次传递，则第二个优化是使用（n-1）-gram结果跳过低于阈值的句子部分。 Eg if you are only interested in n-grams that occur 3+ times, and if "He is a clever" only had a count of 2 at the 4-gram analysis, then when you discover the 5-gram of "He is a clever dog" you can throw it away, as you know it only occurs once or twice. 例如，如果你只对3次以上的n-gram感兴趣，并且如果“他是聪明的”在4-gram分析中只有2，那么当你发现5克“他是一个聪明的狗“你可以扔掉它，因为你知道它只会发生一次或两次。 This is a memory optimization at the expense of extra CPU. 这是一个以额外CPU为代价的内存优化。

人们如何使用n-gram进行情绪分析，考虑到当n增加时，内存需求也会迅速增加？

问题描述

2 个解决方案

解决方案1
5 2014-11-10 05:44:56

解决方案2
2 2014-11-10 08:40:31

人们如何使用n-gram进行情绪分析，考虑到当n增加时，内存需求也会迅速增加？

问题描述

2 个解决方案

解决方案1 5 2014-11-10 05:44:56

解决方案2 2 2014-11-10 08:40:31

解决方案1
5 2014-11-10 05:44:56

解决方案2
2 2014-11-10 08:40:31