简体   繁体   English

如何在逻辑回归中正确考虑国家影响?

[英]How to correctly account for country effects in logistic regression?

I use a database with entries at firm-level in 12 countries in 2008. I try to estimate innovation (0/1) based on few firm-level variables. 我使用的数据库包含2008年在12个国家/地区的公司级别的条目。我尝试根据一些公司级别的变量来估计创新(0/1)。 I also want to see if / how much innovation is also due to country-level effects. 我也想看看国家级的影响是否也带来了多少创新。 Thus I want to control for country effects. 因此,我想控制国家的影响。 If I introduce i.country in my logistic regression I get negative z values for each country. 如果在逻辑回归中引入i.country,则每个国家/地区的z值均为负。 I feel this is not right because when I look at data, only one country has 0 for innovation more frequently than 1. 我觉得这是不对的,因为当我查看数据时,只有一个国家的创新次数比0多于1。

Countries take values as 52, 54, 55.. and 92 Bellow is a split of innovation responses by firm-countries. 各国的价值分别为52、54、55 ..和92。贝娄是公司对创新的回应的分裂。 I tries two things: one is to have i.country in regression and other is to use dummies. 我尝试了两件事:一是让i.country回归,另一则是使用虚拟人。 I created dummies for countries and I introduced them all in regression. 我为国家/地区创建了虚拟变量,并在回归中引入了它们。 Which is correct and how I interpret this? 哪个是正确的,我如何解释?

. tabulate Country INNOV

           | NEW PROD LAST 3 yr?

   Country |         0          1 |     Total
-----------+----------------------+----------
        52 |         4         28 |        32 
        54 |        25         48 |        73 
        55 |        40         48 |        88 
        58 |        40         96 |       136 
        59 |         4         40 |        44 
        60 |        14         29 |        43 
        61 |        39         55 |        94 
        62 |        35         47 |        82 
        75 |        10         54 |        64 
        78 |        28         51 |        79 
        90 |        29        138 |       167 
        92 |       105         69 |       174 
-----------+----------------------+----------
     Total |       373        703 |     1,076 

Here I look by one country no independent variables. 在这里,我看一个国家没有自变量。 The odds of innovation if country is 90 (Germany) is positive. 如果国家为90(德国),创新的可能性为正。 If I repeat this country by country, only 92 gets z as negative 如果我逐个国家重复此国家,则只有92会得到z为负


. logistic INNOV if Country==90

Logistic regression                               Number of obs   =        167
                                                  LR chi2(0)      =      -0.00
                                                  Prob > chi2     =          .
Log likelihood = -77.092379                       Pseudo R2       =    -0.0000

------------------------------------------------------------------------------
       INNOV | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _cons |   4.758621   .9720773     7.64   0.000       3.1886    7.101696
------------------------------------------------------------------------------

Here I run regression with one independent variable and while controlling (??) for country effects .. z values for countries are negative (why?) 在这里,我使用一个自变量进行回归,并同时控制(??)国家影响..国家的z值为负(为什么?)


. logistic INNOV i.Country Mang_MNEexperience 

Logistic regression                               Number of obs   =        481
                                                  LR chi2(12)     =      61.89
                                                  Prob > chi2     =     0.0000
Log likelihood = -283.25686                       Pseudo R2       =     0.0985

------------------------------------------------------------------------------------
             INNOV | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------------+----------------------------------------------------------------
           Country |
               54  |   .3689431   .3062865    -1.20   0.230     .0724962    1.877602
               55  |   .2295013   .1909388    -1.77   0.077     .0449375    1.172091
               58  |   .4457037   .3627449    -0.99   0.321      .090423    2.196917
               59  |   1.689363   1.597736     0.55   0.579     .2646602    10.78344
               60  |   .7459045   .7228328    -0.30   0.762     .1116376    4.983748
               61  |   .1580636   .1313537    -2.22   0.026     .0310076    .8057415
               62  |   .3256028   .2674703    -1.37   0.172     .0650816    1.628988
               75  |   .9975062    1.10341    -0.00   0.998     .1141151    8.719431
               78  |   .6885038   .6454499    -0.40   0.691     .1096308    4.323944
               90  |   .5391077   .4787809    -0.70   0.487     .0945637    3.073453
               92  |   .0549765   .0542165    -2.94   0.003     .0079569    .3798489
                   |
Mang_MNEexperience |   1.083192   .0309218     2.80   0.005      1.02425    1.145525
             _cons |   4.357274   3.409211     1.88   0.060     .9401977    20.19345
------------------------------------------------------------------------------------

Here I use dummies to control for countries 在这里我用假人控制国家

. logistic INNOV countrydummy1 countrydummy2 countrydummy3 countrydummy4 countrydummy5 countrydummy6 
> countrydummy7 countrydummy8 countrydummy9 countrydummy10 countrydummy11 countrydummy12 Mang_MNEexpe
> rience 
note: countrydummy12 omitted because of collinearity

Logistic regression                               Number of obs   =        481
                                                  LR chi2(12)     =      61.89
                                                  Prob > chi2     =     0.0000
Log likelihood = -283.25686                       Pseudo R2       =     0.0985

------------------------------------------------------------------------------------
             INNOV | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------------+----------------------------------------------------------------
     countrydummy1 |   18.18959   17.93813     2.94   0.003     2.632625    125.6773
     countrydummy2 |   6.710925   4.467707     2.86   0.004     1.820148    24.74333
     countrydummy3 |   4.174535    2.79421     2.13   0.033     1.124241     15.5009
     countrydummy4 |   8.107169   5.216314     3.25   0.001     2.297149    28.61207
     countrydummy5 |   30.72882   24.17314     4.35   0.000     6.575663    143.5993
     countrydummy6 |    13.5677   11.30987     3.13   0.002     2.648225    69.51165
     countrydummy7 |   2.875114   1.907619     1.59   0.111     .7832285    10.55411
     countrydummy8 |   5.922583   3.865446     2.73   0.006     1.648026    21.28425
     countrydummy9 |   18.14423   17.87654     2.94   0.003     2.630846    125.1358
    countrydummy10 |   12.52361   9.977246     3.17   0.002     2.627836    59.68436
    countrydummy11 |    9.80615   6.794078     3.30   0.001     2.522048    38.12797
    countrydummy12 |          1  (omitted)
Mang_MNEexperience |   1.083192   .0309218     2.80   0.005      1.02425    1.145525
             _cons |   .2395476   .1453217    -2.36   0.018     .0729475    .7866357
------------------------------------------------------------------------------------

  1. Which way is the correct one? 哪种方法是正确的?
  2. Why in using i.country z is negative and in using dummies z is positive 为什么在使用i.country时z为负,而在使用假人时z为正
  3. How do I interpret country effects? 我如何解释国家影响?

I assume there is another array( at the top not included in your example code. Something like this? 我假设有另一个数组(在顶部没有包含在你的示例代码中。这样的东西?

function array_set_depth($array, $depth = -1)
{
  $subdepth = $depth + 1;
  if ($depth < 0) {
    foreach ($array as $key => $subarray) {
      $temp[$key] = array_set_depth(($subarray), $subdepth);
    }
  }
  if ($array['hasChildren'] && isset($array['children'])) {
    foreach ($array['children'] as $key => $subarray) {
      $temp[$key] = array_set_depth($subarray, $subdepth);
    }
    $array['children'] = $temp;
  }
  $array['depth'] = $depth;
  return $array;
}

Example usage, I set your array to the value $a: 用法示例,我将您的数组设置为值$ a:

$b = array_set_depth($a);
print_r($b);

Edit: 编辑:

To set depth before the children for nice printing you can do this: 要在孩子们之前设置深度以进行漂亮的打印,您可以这样做:

function array_set_depth($array, $depth = -1)
{
  $subdepth = $depth + 1;
  if ($depth < 0) {
    foreach ($array as $key => $subarray) {
      $temp[$key] = array_set_depth(($subarray), $subdepth);
    }
    return $temp;
  }
  $array['depth'] = $depth;
  if ($array['hasChildren'] && isset($array['children'])) {
    foreach ($array['children'] as $key => $subarray) {
      $temp[$key] = array_set_depth($subarray, $subdepth);
    }
    unset($array['children']);
    $array['children'] = $temp;
  }
  return $array;
}

A recursive function like this should do it? 像这样的递归函数应该这样做吗?

function setDepth(&$a, $depth)
{
    $a['depth']=$depth;
    foreach($a as $key=>$value)
    {
        if (is_array($value))
           setDepth($a[$key], $depth+1);
    }

}

The thing to note is that the array is passed by reference, so that we can modify it. 需要注意的是,数组是通过引用传递的,因此我们可以对其进行修改。 Note that we also use this reference in the recursive call to setDepth. 请注意,我们还在对setDepth的递归调用中使用此引用。 Although I used foreach for convenience, the $value variable is a copy, and passing that to setDepth would only make short lived changes within the scope of the foreach loop. 虽然我使用foreach是为了方便,但$ value变量是一个副本,并且将它传递给setDepth只会在foreach循环的范围内进行短暂的更改。

Modified Pauls code to work with this example. 修改了Pauls代码以使用此示例。

function setDepth(&$a, $depth = -1)
{
    if (($depth > -1) && !($depth % 2))
      $a['depth']= $depth / 2;
    foreach($a as $key=>$value)
    {
        if (is_array($value))
           setDepth($a[$key], $depth+1);
    }

}
setDepth($a);
print_r($a);

Country is a categorical variable, so you should definitely use dummy encoding. Country是一个类别变量,因此您绝对应该使用虚拟编码。 I do not know Stata, so I am not sure what Stata did in the other case, but it is probably wrong. 我不了解Stata,所以我不确定Stata在其他情况下做了什么,但这可能是错误的。 Looking at the odds ratios (and its standard errors), it certainly looks like Country is an important variable, but note that with logistic regresson, those asymptotic (based on a normal approximation to the log likelihood function) can be terrible, so don't trust them as is. 从优势比(及其标准误)来看, Country当然是一个重要变量,但是请注意,在使用逻辑回归时,那些渐近线(基于对数似然函数的正常近似)可能会很糟糕,所以请不要这样做。不要原样相信他们。 This is the Hauck-Donner phenomenon. 这就是Hauck-Donner现象。 What you should do is to test the variable Country as a whole, which you can do by fitting a model without that variable, then with it (otherwise identical), and compare the loglikelihoods. 您应该做的是对变量Country进行整体测试,您可以通过在没有该变量的情况下拟合模型,然后对其进行拟合(否则相同),然后比较对数似然来进行测试。 There could be a direct way of doing it in Stata. 在Stata中可能有直接的方法。

But there are also other problems with binomial glm's (generalized linear models), so you some serious stydu before using them! 但是,二项式glm(广义线性模型)还存在其他问题,因此在使用它们之前,请先认真阅读一下! For instance, overdispersion . 例如, 过度分散 You should check for that also, as a matter of routine. 作为常规,您也应该检查一下。

You could also have a look at Principled way of collapsing categorical variables with many levels? 您还可以查看将类别变量折叠为多个级别的原则的方法吗? .

sth like this should do the trick: 这样做应该可以解决问题:

function setdepth($arr, $depth = 0)
{
    foreach ($arr as $key => $val)
    {
        $arr[$key]['depth'] = $depth;
        if ($arr[$key]['hasChildren'])
        {
            setdepth(&$arr[$key]['children'], $depth+1);
        }
    }
}

i would be easier if your array started with index not with values, so example usage could be like this: 如果你的数组以索引而不是值开头,我会更容易,所以示例用法可能是这样的:

$arr[0] = $website;
setdepth(&$arr, 0);

where website is the array from your example 其中website是您示例中的数组

This might be helpful: 这可能会有所帮助:

function extend( $arr, $myArr=array() ) {

    foreach( $arr as $key => $value ) {
        if( is_array( $key ) ) {
            extend( $arr[ $key ] );
        } else {
            $myArr[ $key ] = $arr[ $key ];
        }
    }

    return $myArr;
}

Function called "extend" because it's not only copies array into new one, it can also extends existing arrays. 称为“扩展”的函数,因为它不仅将数组复制到新数组中,还可以扩展现有数组。

To extend an array you should put it as the second parameter, otherwise put an empty array. 要扩展数组,您应该将其作为第二个参数,否则放入一个空数组。 The function lopps through array properties and checks is it an array or not and if it is function envoked again otherwise it copies values into another array and returns it. 函数lopps通过数组属性和检查它是否是一个数组,如果它是再次envoked函数否则它将值复制到另一个数组并返回它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM