简体   繁体   English

无法在启用 ECC 的 NVIDIA 设备上创建上下文

[英]Cannot create context on NVIDIA device with ECC enabled

On a node with 4 NVIDIA GPUs I enabled on device 0 the ECC memory protection (all other have ECC disabled).在具有 4 个 NVIDIA GPU 的节点上,我在设备 0 上启用了 ECC 内存保护(所有其他都禁用了 ECC)。 Since I enabled ECC on device 0 my application (CUDA, using just one device) hangs when it tries to create the context on this device 0 (driver API).由于我在设备 0 上启用了 ECC,我的应用程序(CUDA,仅使用一个设备)在尝试在此设备 0(驱动程序 API)上创建上下文时挂起。 I don't know why it hangs at that point.我不知道为什么它在那一刻挂起。 If I use a different device setting CUDA_VISIBLE_DEVICE accordingly to another device it works fine.如果我使用不同的设备设置 CUDA_VISIBLE_DEVICE 相应地另一个设备它工作正常。 It must have to do with enabling ECC.它必须与启用 ECC 有关。 Any thoughts?有什么想法吗? Here the output of nvidia-smi : (Why does it report 99% volatile GPU utilization, nothing is running there?)这里是nvidia-smi的输出:(为什么它报告 99% 不稳定的 GPU 利用率,那里什么都没有运行?)

+------------------------------------------------------+                       
| NVIDIA-SMI 4.304.54   Driver Version: 304.54         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name                     | Bus-Id        Disp.  | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage         | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20m               | 0000:02:00.0     Off |                    1 |
| N/A   29C    P0    49W / 225W |   0%   12MB / 4799MB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20m               | 0000:03:00.0     Off |                    0 |
| N/A   22C    P8    15W / 225W |   0%   12MB / 4799MB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K20m               | 0000:83:00.0     Off |                    0 |
| N/A   22C    P8    24W / 225W |   0%   11MB / 4799MB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K20m               | 0000:84:00.0     Off |                    0 |
| N/A   23C    P8    25W / 225W |   0%   11MB / 4799MB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|  No running compute processes found                                         |
+-----------------------------------------------------------------------------+

EDIT: nvidia-smi -a reports ECC enabled on all devices.编辑: nvidia-smi -a报告在所有设备上启用 ECC。 Strange!奇怪的!

==============NVSMI LOG==============

Timestamp                       : Fri Apr 26 10:18:14 2013
Driver Version                  : 304.54

Attached GPUs                   : 4
GPU 0000:02:00.0
    Product Name                : Tesla K20m
    Display Mode                : Disabled
    Persistence Mode            : Enabled
    Driver Model
        Current                 : N/A
        Pending                 : N/A
    Serial Number               : 0324512044699
    VBIOS Version               : 80.10.11.00.0B
    Inforom Version
        Image Version           : 2081.0208.01.07
        OEM Object              : 1.1
        ECC Object              : 3.0
        Power Management Object : N/A
    GPU Operation Mode
        Current                 : Compute
        Pending                 : Compute
    PCI
        Bus                     : 0x02
        Device                  : 0x00
        Domain                  : 0x0000
        Device Id               : 0x102810DE
        Bus Id                  : 0000:02:00.0
        Sub System Id           : 0x101510DE
        GPU Link Info
            PCIe Generation
                Max             : 2
                Current         : 2
            Link Width
                Max             : 16x
                Current         : 16x
    Fan Speed                   : N/A
    Performance State           : P0
    Clocks Throttle Reasons
        Idle                    : Not Active
        User Defined Clocks     : Not Active
        SW Power Cap            : Not Active
        HW Slowdown             : Not Active
        Unknown                 : Not Active
    Memory Usage
        Total                   : 4799 MB
        Used                    : 12 MB
        Free                    : 4787 MB
    Compute Mode                : Default
    Utilization
        Gpu                     : 99 %
        Memory                  : 0 %
    Ecc Mode
        Current                 : Enabled
        Pending                 : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
            Double Bit            
                Device Memory   : 1
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 1
        Aggregate
            Single Bit            
                Device Memory   : 1
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 1
            Double Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
    Temperature
        Gpu                     : 29 C
    Power Readings
        Power Management        : Supported
        Power Draw              : 49.51 W
        Power Limit             : 225.00 W
        Default Power Limit     : 225.00 W
        Min Power Limit         : 150.00 W
        Max Power Limit         : 225.00 W
    Clocks
        Graphics                : 758 MHz
        SM                      : 758 MHz
        Memory                  : 2600 MHz
    Applications Clocks
        Graphics                : 705 MHz
        Memory                  : 2600 MHz
    Max Clocks
        Graphics                : 758 MHz
        SM                      : 758 MHz
        Memory                  : 2600 MHz
    Compute Processes           : None

GPU 0000:03:00.0
    Product Name                : Tesla K20m
    Display Mode                : Disabled
    Persistence Mode            : Enabled
    Driver Model
        Current                 : N/A
        Pending                 : N/A
    Serial Number               : 0324512044821
    VBIOS Version               : 80.10.11.00.0B
    Inforom Version
        Image Version           : 2081.0208.01.07
        OEM Object              : 1.1
        ECC Object              : 3.0
        Power Management Object : N/A
    GPU Operation Mode
        Current                 : Compute
        Pending                 : Compute
    PCI
        Bus                     : 0x03
        Device                  : 0x00
        Domain                  : 0x0000
        Device Id               : 0x102810DE
        Bus Id                  : 0000:03:00.0
        Sub System Id           : 0x101510DE
        GPU Link Info
            PCIe Generation
                Max             : 2
                Current         : 1
            Link Width
                Max             : 16x
                Current         : 16x
    Fan Speed                   : N/A
    Performance State           : P8
    Clocks Throttle Reasons
        Idle                    : Active
        User Defined Clocks     : Not Active
        SW Power Cap            : Not Active
        HW Slowdown             : Not Active
        Unknown                 : Not Active
    Memory Usage
        Total                   : 4799 MB
        Used                    : 12 MB
        Free                    : 4787 MB
    Compute Mode                : Default
    Utilization
        Gpu                     : 0 %
        Memory                  : 0 %
    Ecc Mode
        Current                 : Enabled
        Pending                 : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
            Double Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
        Aggregate
            Single Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
            Double Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
    Temperature
        Gpu                     : 19 C
    Power Readings
        Power Management        : Supported
        Power Draw              : 15.22 W
        Power Limit             : 225.00 W
        Default Power Limit     : 225.00 W
        Min Power Limit         : 150.00 W
        Max Power Limit         : 225.00 W
    Clocks
        Graphics                : 324 MHz
        SM                      : 324 MHz
        Memory                  : 324 MHz
    Applications Clocks
        Graphics                : 705 MHz
        Memory                  : 2600 MHz
    Max Clocks
        Graphics                : 758 MHz
        SM                      : 758 MHz
        Memory                  : 2600 MHz
    Compute Processes           : None

GPU 0000:83:00.0
    Product Name                : Tesla K20m
    Display Mode                : Disabled
    Persistence Mode            : Enabled
    Driver Model
        Current                 : N/A
        Pending                 : N/A
    Serial Number               : 0324512044783
    VBIOS Version               : 80.10.11.00.0B
    Inforom Version
        Image Version           : 2081.0208.01.07
        OEM Object              : 1.1
        ECC Object              : 3.0
        Power Management Object : N/A
    GPU Operation Mode
        Current                 : Compute
        Pending                 : Compute
    PCI
        Bus                     : 0x83
        Device                  : 0x00
        Domain                  : 0x0000
        Device Id               : 0x102810DE
        Bus Id                  : 0000:83:00.0
        Sub System Id           : 0x101510DE
        GPU Link Info
            PCIe Generation
                Max             : 2
                Current         : 1
            Link Width
                Max             : 16x
                Current         : 16x
    Fan Speed                   : N/A
    Performance State           : P8
    Clocks Throttle Reasons
        Idle                    : Active
        User Defined Clocks     : Not Active
        SW Power Cap            : Not Active
        HW Slowdown             : Not Active
        Unknown                 : Not Active
    Memory Usage
        Total                   : 4799 MB
        Used                    : 11 MB
        Free                    : 4788 MB
    Compute Mode                : Default
    Utilization
        Gpu                     : 0 %
        Memory                  : 0 %
    Ecc Mode
        Current                 : Enabled
        Pending                 : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
            Double Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
        Aggregate
            Single Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
            Double Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
    Temperature
        Gpu                     : 22 C
    Power Readings
        Power Management        : Supported
        Power Draw              : 24.74 W
        Power Limit             : 225.00 W
        Default Power Limit     : 225.00 W
        Min Power Limit         : 150.00 W
        Max Power Limit         : 225.00 W
    Clocks
        Graphics                : 324 MHz
        SM                      : 324 MHz
        Memory                  : 324 MHz
    Applications Clocks
        Graphics                : 705 MHz
        Memory                  : 2600 MHz
    Max Clocks
        Graphics                : 758 MHz
        SM                      : 758 MHz
        Memory                  : 2600 MHz
    Compute Processes           : None

GPU 0000:84:00.0
    Product Name                : Tesla K20m
    Display Mode                : Disabled
    Persistence Mode            : Enabled
    Driver Model
        Current                 : N/A
        Pending                 : N/A
    Serial Number               : 0324512044628
    VBIOS Version               : 80.10.11.00.0B
    Inforom Version
        Image Version           : 2081.0208.01.07
        OEM Object              : 1.1
        ECC Object              : 3.0
        Power Management Object : N/A
    GPU Operation Mode
        Current                 : Compute
        Pending                 : Compute
    PCI
        Bus                     : 0x84
        Device                  : 0x00
        Domain                  : 0x0000
        Device Id               : 0x102810DE
        Bus Id                  : 0000:84:00.0
        Sub System Id           : 0x101510DE
        GPU Link Info
            PCIe Generation
                Max             : 2
                Current         : 1
            Link Width
                Max             : 16x
                Current         : 16x
    Fan Speed                   : N/A
    Performance State           : P8
    Clocks Throttle Reasons
        Idle                    : Active
        User Defined Clocks     : Not Active
        SW Power Cap            : Not Active
        HW Slowdown             : Not Active
        Unknown                 : Not Active
    Memory Usage
        Total                   : 4799 MB
        Used                    : 11 MB
        Free                    : 4788 MB
    Compute Mode                : Default
    Utilization
        Gpu                     : 0 %
        Memory                  : 0 %
    Ecc Mode
        Current                 : Enabled
        Pending                 : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
            Double Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
        Aggregate
            Single Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
            Double Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
    Temperature
        Gpu                     : 23 C
    Power Readings
        Power Management        : Supported
        Power Draw              : 25.47 W
        Power Limit             : 225.00 W
        Default Power Limit     : 225.00 W
        Min Power Limit         : 150.00 W
        Max Power Limit         : 225.00 W
    Clocks
        Graphics                : 324 MHz
        SM                      : 324 MHz
        Memory                  : 324 MHz
    Applications Clocks
        Graphics                : 705 MHz
        Memory                  : 2600 MHz
    Max Clocks
        Graphics                : 758 MHz
        SM                      : 758 MHz
        Memory                  : 2600 MHz
    Compute Processes           : None

The nvidia-smi output shows an uncorrectable ECC error on the device. nvidia-smi 输出显示设备上存在不可纠正的 ECC 错误。 You can reset the error using nvidia-smi --reset-ecc-errors=0 -g 0 and retry.您可以使用nvidia-smi --reset-ecc-errors=0 -g 0重置错误并重试。 The 0 in the reset indicates to reset the volatile counter only, the aggregate counter will still indicate that an error has happened in the past.重置中的0表示仅重置易失性计数器,聚合计数器仍将表示过去发生过错误。

If you see further errors from the device then it would be worth investigating the cause further.如果您从设备中看到更多错误,则值得进一步调查原因。

Note that in the summary view the ECC field you are looking at is actually "Volatile Uncorr. ECC", ie it's the error count not the ECC enabled/disabled flag.请注意,在摘要视图中,您正在查看的 ECC 字段实际上是“Volatile Uncorr. ECC”,即它是错误计数,而不是 ECC 启用/禁用标志。 If ECC is disabled it will say "N/A".如果 ECC 被禁用,它将显示“N/A”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM