简体繁体 English

您如何确定CCA中有多少个变量？

[英]How do you determine how many variables is too many for a CCA?

原文 2017-01-12 03:43:15 5 1 vegan

I am running a CCA of some ecological data with ~50 sites and several hundred species. 我正在运行一些生态数据的CCA，其中包含约50个地点和数百种物种。 I know that you have to be careful when your number of explanatory variables approaches your number of samples. 我知道当解释变量的数量接近样本数量时，您必须要小心。 I have 23 explanatory variables, so this isn't a problem for me, but I have also heard that using too many explanatory variables can start to "un-constrain" the CCA. 我有23个解释变量，所以这对我来说不是问题，但是我还听说使用太多的解释变量会开始“限制” CCA。

Are there any guidelines about how many explanatory variables is appropriate? 是否有关于适当的解释变量的指导原则？ So far, I have just plotted them all and then removed the ones that appear to be redundant (leaving me with 8). 到目前为止，我只绘制了所有图，然后删除了看起来多余的图（剩下8个）。 Can I use the intertia values to help inform/justify this? 我可以使用中间值来帮助告知/论证吗？

Thanks 谢谢

1 个解决方案

This is the same question as asking "how many variables are too many for regression analysis?". 这与询问“对于回归分析而言有多少个变量太多？”是同一个问题。 Not "almost the same", but exactly the same: CCA is an ordination of fitted values of linear regression. 不是“几乎相同”，而是完全相同：CCA是线性回归拟合值的排序。 In most severe cases you can over-fit. 在最严重的情况下，您可能会过度适应。 In CCA this is evident when the first eigenvalues of CCA and (unconstrained) CA are almost identical and the ordinations look similar in first dimensions (you can use Procrustes analysis to check this). 在CCA中，当CCA和（无约束）CA的第一个特征值几乎相同并且在第一个维度上看起来相似时，这是显而易见的（您可以使用Procrustes分析进行检查）。 Extreme case would be that residual variation disappears, but in ordination you focus on first dimensions, and there the constraints can get lost much earlier than in later constrained axes or in residuals. 极端的情况是残差变化消失了，但是在协调中，您将注意力集中在第一维上，并且在那里的约束可能比后面的约束轴或残差更早地丢失。 More importantly: you must see CCA as a kind of regression analysis and have the same attitude to constraints as to explanatory (independent) variables in regression. 更重要的是：您必须将CCA视为一种回归分析，并且对约束的态度与对回归中的解释性（独立）变量的态度相同。 If you have no prior hypothesis to study, you have all the problems of model selection of regression analysis plus the problems of multivariate ordination, but these are non-technical problems that should be handled somewhere else than in stackoverflow. 如果您没有先前的假设要研究，则将面临回归分析模型选择的所有问题以及多元排序的问题，但这是非技术性问题，应在堆栈溢出之外的其他地方处理。