简体   繁体   中英

Is it possible to apply Machine Learning algorithm to predict Failure in large HPC systems based on a years of systematic data collection?

The provided CSV dataset categories look like the following:

DATE | Hardware Identifier | What Failed | Description of Failure | Action Taken

The complete data can be easily downloaded from the Dropbox service using this link: data.csv The data is very systematic, the input is very consistent and nicely structured. This data comes from a Computer Failure Data Repository. Additional details can be found on this link at USENIX: PNNL

About the data: There are somewhat little over 2800 entries of single failure events that were collected over 4 years. Each event is described by the exact date and time when the event took place, what Node in the system failed, what hardware component of that node failed.

About the system: Consists of 980 nodes processing some heavy calculation for the Molecular Science Computing Facility. Each node is designated by its own, unique ID.

My question: Is it possible to perform any meaningful Machine Learning technique on such dataset, that would, in the end, be capable of predicting future failures in the system? For example, would it be possible to train the ML algorithm on the provided dataset in order to predict either:

  • What node might fail soon (based on Hardware Identifier field)
  • What (node-piece of hardware) combination might fail soon (based on Hardware Identifier and either What failed or Description of Failure field)
  • What kind of failure might occur next anywhere in the system (based on What Failed field)

To me, this sounds like a huge classification problem. For example, in the case of (node-piece of hardware that failed), there are several thousands of different possibilities (classes). Having in mind that there are only little over 2800 single failure events described in the table, I don't feel like this would work. Also, I am confused about how I should feed the data into the algorithm. Should the only input to the algorithm be the DATE field (converted to numeric linear growing time)? That doesn't seem right. Is it possible to feed the algorithm somehow with the time variable combined with some history of recent failure events? Should I restructure data to feed the algorithm with time variable + failure history (that might be limited, for example, to the last 30 days, or to feed the whole failure history of the system)?

May I hear your opinion? Is it possible to train an algorithm from this dataset that could predict any of the above-mentioned failure events (like, ie What node will fail next) given some input of the system (I can only think of time as an input for now, but that sounds wrong).

Since I am just starting to get involved with the ML algorithms, my thinking on the topic is probably very narrow and limited, so please feel free to suggest if you feel I should take a completely different approach on this.

Before we go on, remember that these failures are generally considered fairly random, so any results you get will likely be fairly unreliable.

The main problem to consider, is that you have very little data compared to the amount of nodes, slightly less than 3 on average, which means that you have to use some incredibly simple models, that would not give you much advantage over a random guess, for you to even have any certainty in your variables (separate mean time between failure would not have a determinable error, if it is even calculateable). For this I would probably treat each node as a separate test point, and then train a tree based algorithm to try to predict when the last failure in the nodes sequence of failures is, but that also mean that it would only be applicable to a subset of the database. This might be able to vaguely predict whether the node will fall in the near future and what type it would most likely be, but it like be fairly close to the estimate of mean time to failure and most common failure for all nodes.

If you want some meaningful results, you will need to have some attributes of the nodes that you can do the machine learning on, such as hardware components and when they were installed, and then have that as input in the classification. Since the problem will likely behave fairly randomly, you would get more information from trying to solve the regression problem instead of the classification problem, since you can still get good precision on a probabilistic model, even though the classification itself would be highly uncertain.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM