繁体   English   中英

它是一个内核冻结

[英]is it a kernel freeze

我们在Linux嵌入式系统中缺少中断,其中多核运行在1.25GHz。

背景:

  • 内核版本:2.6.32.27
  • 我们有需要实时性能的用户空间流程。
  • 它们在1ms的边界内运行。
    • 也就是说在1ms内他们应该完成一组任务,最多可能需要800uS左右。
  • 我们有一个外部组件FPGA,通过配置为边沿触发中断的GPIO引脚为多核处理器提供1ms和10ms的中断。
  • 这些中断在内核驱动程序中处理。

软件体系结构是这样一种方式,即用户进程在完成其工作后将对GPIO驱动程序进行ioctl。

在此ioctl中,驱动程序将进程置于wakeup_interruptible状态。 每当接收到下一个1ms中断时,ISR将唤醒该过程。 这个循环重复。

使用smp_affinity将1ms和10ms中断路由到处理器的单个核心。

问题:

  • 有时我们发现错过了一些中断。
    • (即ISR本身不会被调用)。
  • 12到20分钟后,ISR正常命中。
  • 我们可以通过分析连续ISR调用之间的持续时间,并使计数器在ISR中首先递增来理解。

这主要发生在过程级别的高系统负载期间,并且是随机的并且难以重现。

我附上了骨架代码。

首先,我必须确定它是硬件还是软件问题。 由于它是一个提供中断的FPGA,我们对硬件没有太多疑问。

这个内核冻结了吗? 这是自cpu周期递增以来最可能发生的情况。

可能是由于热条件导致cpu冻结的情况吗? 如果是这样,那么cpu周期就不会在第一位增加。

考虑到我们正在处理的内核版本以及此内核版本中可用的分析/调试工具,任何调试/隔离根本原因的指针都将非常有用。

骨架代码:

/* Build time Configuration */

/* Macros */
DECLARE_WAIT_QUEUE_HEAD(wait);

/** Structure Definitions */
/** Global Variables */
gpio_dev_t gpio1msDev, gpio10msDev;
GpioIntProfileSectorData_t GpioSigProfileData[MAX_GPIO_INT_CONSUMERS];
GpioIntProfileSectorData_t *ProfilePtrSector;
GpioIntProfileData_t GpioProfileData;
GpioIntProfileData_t *GpioIntProfilePtr;
CurrentTickProfile_t TimeStamp;
uint64_t ModuleInitDone = 0, FirstTimePIDWrite = 0;
uint64_t PrevCycle = 0, NowCycle = 0;
volatile uint64_t TenMsFlag, OneMsFlag;
uint64_t OneMsCounter;
uint64_t OneMsIsrTime, TenMsIsrTime;
uint64_t OneMsCounter, OneMsTime, TenMsTime, SyncStarted;
uint64_t Prev = 0, Now = 0, DiffTen = 0, DiffOne, SesSyncHappened;
static spinlock_t GpioSyncLock = SPIN_LOCK_UNLOCKED;
static spinlock_t IoctlSyncLock = SPIN_LOCK_UNLOCKED;
uint64_t EventPresent[MAX_GPIO_INT_CONSUMERS];

GpioEvent_t CurrentEvent = KERN_NO_EVENT;
TickSyncSes_t *SyncSesPtr = NULL;


/** Function Declarations */

ssize_t write_pid(struct file *filep, const char __user * buf, size_t count, loff_t * ppos);
long Gpio_compat_ioctl(struct file *filep, unsigned int cmd, unsigned long arg);

static const struct file_operations my_fops = {
 write:write_pid,
 compat_ioctl:Gpio_compat_ioctl,
};




/**
 * IOCTL function for GPIO interrupt module
 *
 * @return
 */
long Gpio_compat_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
{
int len = 1, status = 0;
    uint8_t Instance;
    uint64_t *EventPtr;
    GpioIntProfileSectorData_t *SectorProfilePtr, *DebugProfilePtr;
    GpioEvent_t EventToGive = KERN_NO_EVENT;
    pid_t CurrentPid = current->pid;

    spin_lock(&IoctlSyncLock);  // Take the spinlock
    Instance = GetSector(CurrentPid);
    SectorProfilePtr = &GpioSigProfileData[Instance];
    EventPtr = &EventPresent[Instance];
    spin_unlock(&IoctlSyncLock);

    if (Instance <= MAX_GPIO_INT_CONSUMERS)
    {
        switch (cmd)
        {
        case IOCTL_WAIT_ON_EVENT:
            if (*EventPtr)
            {
                /* Dont block here since this is a case where interrupt has happened
                 * before process calling the polling API */
                *EventPtr = 0;
                /* some profiling code */
            }
            else
            {
                status = wait_event_interruptible(wait, (*EventPtr == 1));
                *EventPtr = 0;
            }

            /* profiling code */

            TimeStamp.CurrentEvent = EventToGive;
            len = copy_to_user((char *)arg, (char *)&TimeStamp, sizeof(CurrentTickProfile_t));
            break;
        default:
            break;
        }
    }
    else
    {
        return -EINVAL;
    }

    return 0;
}

/**
 * Send signals to registered PID's.
 *
 * @return
 */
static void WakeupWaitQueue(GpioEvent_t Event)
{
    int i;

    /* some profile code */

    CurrentEvent = Event;

    // we dont wake up debug app hence "< MAX_GPIO_INT_CONSUMERS" is used in for loop
    for (i = 0; i < MAX_GPIO_INT_CONSUMERS; i++)
    {
        EventPresent[i] = 1;
    }
    wake_up_interruptible(&wait);
}

/**
 * 1ms Interrupt handler
 *
 * @return
 */
static irqreturn_t gpio_int_handler_1ms(int irq, void *irq_arg)
{
    uint64_t reg_read, my_core_num;
    unsigned long flags;
    GpioEvent_t event = KERN_NO_EVENT;

    /* code to clear the interrupt registers */


    /************ profiling start************/
    NowCycle = get_cpu_cycle();
    GpioIntProfilePtr->TotalOneMsInterrupts++;

    /* Check the max diff between consecutive interrupts */
    if (PrevCycle)
    {
        DiffOne = NowCycle - PrevCycle;
        if (DiffOne > GpioIntProfilePtr->OneMsMaxDiff)
            GpioIntProfilePtr->OneMsMaxDiff = DiffOne;
    }
    PrevCycle = NowCycle;

    TimeStamp.OneMsCount++; /* increment the counter */

    /* Store the timestamp */

    GpioIntProfilePtr->Gpio1msTimeStamp[GpioIntProfilePtr->IndexOne] = NowCycle;
    TimeStamp.OneMsTimeStampAtIsr = NowCycle;
    GpioIntProfilePtr->IndexOne++;
    if (GpioIntProfilePtr->IndexOne == GPIO_PROFILE_ARRAY_SIZE)
        GpioIntProfilePtr->IndexOne = 0;
    /************ profiling end************/

    /*
     * Whenever 10ms Interrupt happens we send only one event to the upper layers.
     * Hence it is necessary to sync between 1 & 10ms interrupts.
     * There is a chance that sometimes 1ms can happen at first and sometimes 10ms.
     *
     */
    /******** Sync mechanism ***********/
    spin_lock_irqsave(&GpioSyncLock, flags);    // Take the spinlock
    OneMsCounter++;
    OneMsTime = NowCycle;
    DiffOne = OneMsTime - TenMsTime;

    if (DiffOne < MAX_OFFSET_BETWEEN_1_AND_10MS)    //ten ms has happened first
    {
        if (OneMsCounter == 10)
        {
            event = KERN_BOTH_EVENT;
            SyncStarted = 1;
        }
        else
        {
            if (SyncStarted)
            {
                if (OneMsCounter < 10)
                {
                    GpioIntProfilePtr->TickSyncErrAt1msLess++;
                }
                else if (OneMsCounter > 10)
                {
                    GpioIntProfilePtr->TickSyncErrAt1msMore++;
                }
            }
        }
        OneMsCounter = 0;
    }
    else
    {
        if (OneMsCounter < 10)
        {
            if (SyncStarted)
            {
                event = KERN_ONE_MS_EVENT;
            }
        }
        else if (OneMsCounter > 10)
        {
            OneMsCounter = 0;
            if (SyncStarted)
            {
                GpioIntProfilePtr->TickSyncErrAt1msMore++;
            }
        }
    }
    TimeStamp.SFN = OneMsCounter;
    spin_unlock_irqrestore(&GpioSyncLock, flags);
    /******** Sync mechanism ***********/

    if(event != KERN_NO_EVENT)
        WakeupWaitQueue(event);

    OneMsIsrTime = get_cpu_cycle() - NowCycle;
    if (GpioIntProfilePtr->Max1msIsrTime < OneMsIsrTime)
        GpioIntProfilePtr->Max1msIsrTime = OneMsIsrTime;
    return IRQ_HANDLED;
}

/**
 * 10ms Interrupt handler
 *
 * @return
 */
static irqreturn_t gpio_int_handler_10ms(int irq, void *irq_arg)
{
    uint64_t reg_read, my_core_num;
    unsigned long flags;
    GpioEvent_t event = KERN_NO_EVENT;

    /* clear the interrupt */

    /************ profiling start************/
    GpioIntProfilePtr->TotalTenMsInterrupts++;
    Now = get_cpu_cycle();
    if (Prev)
    {
        DiffTen = Now - Prev;
        if (DiffTen > GpioIntProfilePtr->TenMsMaxDiff)
            GpioIntProfilePtr->TenMsMaxDiff = DiffTen;
    }
    Prev = Now;
    TimeStamp.OneMsCount++; /* increment the counter */
    TimeStamp.TenMsCount++;
    GpioIntProfilePtr->Gpio10msTimeStamp[GpioIntProfilePtr->IndexTen] = Now;
    TimeStamp.TenMsTimeStampAtIsr = Now;
    //do_gettimeofday(&TimeOfDayAtIsr.TimeAt10MsIsr);
    GpioIntProfilePtr->IndexTen++;
    if (GpioIntProfilePtr->IndexTen == GPIO_PROFILE_ARRAY_SIZE)
        GpioIntProfilePtr->IndexTen = 0;
    /************ profiling end************/

    /******** Sync mechanism ***********/
    spin_lock_irqsave(&GpioSyncLock, flags);
    TenMsTime = Now;
    DiffTen = TenMsTime - OneMsTime;

    if (DiffTen < MAX_OFFSET_BETWEEN_1_AND_10MS)    //one ms has happened first
    {
        if (OneMsCounter == 10)
        {
            TimeStamp.OneMsTimeStampAtIsr = Now;
            event = KERN_BOTH_EVENT;
            SyncStarted = 1;
        }
        OneMsCounter = 0;
    }
    else
    {
        if (SyncStarted)
        {
            if (OneMsCounter < 9)
            {
                GpioIntProfilePtr->TickSyncErrAt10msLess++;
                OneMsCounter = 0;
            }
            else if (OneMsCounter > 9)
            {
                GpioIntProfilePtr->TickSyncErrAt10msMore++;
                OneMsCounter = 0;
            }
        }
        else
        {
            if (OneMsCounter != 9)
                OneMsCounter = 0;
        }
    }
    TimeStamp.SFN = OneMsCounter;
    spin_unlock_irqrestore(&GpioSyncLock, flags);
    /******** Sync mechanism ***********/

    if(event != KERN_NO_EVENT)
        WakeupWaitQueue(event);

    TenMsIsrTime = get_cpu_cycle() - Now;
    if (GpioIntProfilePtr->Max10msIsrTime < TenMsIsrTime)
        GpioIntProfilePtr->Max10msIsrTime = TenMsIsrTime;

    return IRQ_HANDLED;
}

正在重置 EventPresent等待事件发生后 wait_event_interruptible()

EventPtr = &EventPresent[Instance];
...
status = wait_event_interruptible(wait, (*EventPtr == 1));
*EventPtr = 0;

看起来很可疑

如果WakeupWaitQueue()将同时执行,则设置事件

for (i = 0; i < MAX_GPIO_INT_CONSUMERS; i++)
    {
        EventPresent[i] = 1;
    }
wake_up_interruptible(&wait);

会迷路。

对于引发事件和已处理事件,最好有两个独立的计数器

uint64_t EventPresent[MAX_GPIO_INT_CONSUMERS]; // Number if raised events
uint64_t EventProcessed[MAX_GPIO_INT_CONSUMERS]; // Number of processed events

在这种情况下,条件可以是这些计数器的比较:

Gpio_compat_ioctl()
{
    ...
    EventPresentPtr = &EventPresent[Instance];
    EventProcessedPtr = &EventProcessed[Instance];
    ...
    status = wait_event_interruptible(wait, (*EventPresentPtr != *EventProcessedPtr));
    (*EventProcessedPtr)++;
    ...
}

WakeupWaitQueue()
{
    ...
    for (i = 0; i < MAX_GPIO_INT_CONSUMERS; i++)
    {
        EventPresent[i]++;
    }
    wake_up_interruptible(&wait);
}

这不是内核冻结。 我们在系统中有一个免费核心,运行裸机。 我们也将1ms中断路由到这个裸金属核心。 当问题发生时,我们将与裸金属核心配置文件信息进行比较。 在裸金属核心中,ISR在经过的时间内被恰当地线性地击中。 通过这个我们排除了没有硬件问题或热问题。

接下来仔细查看代码,我们开始怀疑spinlock是否导致错过中断。 为了实验,我们更改了逻辑以在没有自旋锁的情况下运行ISR。 现在我们看到没有错过中断。

所以这些问题似乎已经解决了,但是当螺旋锁存在时,系统也能在正常负载条件下正常工作。 此问题仅在非常高的CPU负载期间出现。 这是我没有答案的东西,即仅在高负载条件下,为什么调用自旋锁使得其他中断被遗漏。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM