Fault diagnosis and health management of bearings in rotating equipment based on vibration analysis – a review

There is an ever-increasing need to optimise bearing lifetime and maintenance cost through detecting faults at earlier stages. This can be achieved through improving diagnosis and prognosis of bearing faults to better determine bearing remaining useful life (RUL). Until now there has been limited research into the prognosis of bearing life in rotating machines. Towards the development of improved approaches to prognosis of bearing faults a review of fault diagnosis and health management systems research is presented. Traditional time and frequency domain extraction techniques together with machine learning algorithms, both traditional and deep learning, are considered as novel approaches for the development of new prognosis techniques. Different approaches make use of the advantages of each technique while overcoming the disadvantages towards the development of intelligent systems to determine the RUL of bearings. The review shows that while there are numerous approaches to diagnosis and prognosis, they are suitable for certain cases or are domain specific and cannot be generalised.


Introduction
In recent years, condition monitoring, fault diagnosis and prognosis of equipment have become of increasing concern to industries using rotating machines. Early fault detection in rotating machines can avoid risks of damage and thus save expensive emergency repair costs. When operating as expected, all mechanical and electrical systems create a characteristic signal. If the operating conditions of a machine changes, this will lead to variance in that signal. In fact, differences in a normal signal can be considered an indication of an incipient fault. However, these changes may be so small that the signals are masked by the ambient noise produced by the system's normal operation [1].
Machine Condition Monitoring (CM) is the procedure of monitoring several parameters being an indication of the mechanical condition of a rotating machine whilst it is in operation, such as vibration and temperature. Most new Condition Monitoring Systems are comprised of sensors and a system for acquiring data, integrated with software for signal analysis.
A reliable online machinery CM system permits maintenance or corrective actions to be scheduled to prevent degradation of the machine's performance, malfunctions, or even catastrophic failure [2]. The key purpose of a CM strategy is to enable immediate detection of any new damage in rotating machinery, such as bearings or gears. After the initial detection the CM should determine the location of the fault and its severity and predict the RUL of the component. CM offers the following benefits [3,4]: 1) Avoid catastrophic failure, unscheduled maintenance and loss of production.
2) Reduce maintenance costs by minimising unnecessary interventions and overhauls.
include inner and outer raceway defects, ball defects and bearing elements defects. State of the art signal processing includes techniques capable of denoising and processing vibration signals to detect faults [1]. Abboud et al. [1] identifies two effective bearing diagnostics techniques. The first involves pre-processing the random part of the vibration signal after deterministic components have been removed, this technique uses the minimum entropy deconvolution method (MED) and spectral kurtosis to analyse the signal envelope. The second technique uses cyclostationary to model the bearing signal, it uses a bi-variable map to identify fault components in the distribution [1].

Time-domain techniques
The time-domain signal shows the history of the energy content of the signal. Time-domain signal processing is based on extracting statistically significant behaviour from the waveform of the time-domain signals and has been applied successfully to numerous complex problems [14]. Using the time-domain signal, a defect can be detected and its magnitude assessed using statistics indicators such as the energy content (Root Mean Square -RMS), crest factor (CF), kurtosis (KU) or energy index (EI). Several of these indicators, including KU and CF are more sensitive if the defect size is larger, their values could reduce to the healthy state level when the damage is clearly developed [15].
Time-domain techniques use vibration data as a function of time, simply where time is plotted against the amplitude of the vibration signal. This technique reveals whether the vibration signal is random, repetitive, sinusoidal or transient, however, this technique produces a significant amount of data to be processed [16]. The main statistical techniques in the time-domain approach also include peak value, impulse factor, shape factor and K-factor [16].

Mean and standard deviation
Obtained signals can be described using parameters such as standard deviation ( ) or the mean ( ). Numerically, the mean value is the sum of the events divided by the number of events. [17]. The standard deviation is a tool used for measuring the dispersal in a certain signal. Mathematically, it is the variance under the square root, see Eqs. (1) and (2) [17]. For an event ( ) and size of sample ( ), and are calculated using the following equations respectively:

RMS
RMS is concerned with the energy of the vibration signal and is suitable for the detection of deteriorating bearings, however, the initial peaks in the signal at the beginning stages of the fault cannot be detected using RMS, furthermore, RMS is not suitable for identifying the location of the fault. However, RMS is suitable for measuring the severity of the fault. It is also worth noting that RMS is not sensitive to transient changes that can last for only milliseconds.
Detection with RMS considers the increase in the value of the vibration signal against a normal operations signal. RMS is calculated as follows: represents sample amplitudes of the vibration signal. Seryasat et al. [18] presented a method for diagnosing bearing faults using RMS and FFT, the RMS of the vibration signal changes for different frequency bands when the fault occurs, thus the method was tested using faulty bearings at various loads and speeds whereby the RMS indicates the type of the fault.

Kurtosis
The Kurtosis technique is sensitive to impulsiveness and can detect vibration signals associated with faults in their early stages [12]. Kurtosis is calculated as follows: where denotes the instant amplitude, represents the mean and is the standard deviation of the data and is the length of the sample.
Spectral kurtosis analysis is used to identify the energy content of decomposed signals [19]. The main disadvantage of kurtosis is that it cannot detect the fault at the later stages. Tian et al. [20] presented a method for detection of bearing faults using spectral kurtosis with cross-correlation, they extracted features that represent faults which are combined to create an index using principal component analysis (PCA) and -nearest neighbour (KNN), their method successfully detected incipient faults and identified the location, importantly, it was able to provide a health index to track the degradation of faults. Saidi et al. [21] use a spectral kurtosis data-driven approach for health prognosis of shaft bearings, the method was validated using monotonicity and trendability and real data from a wind turbine drivetrain and it was shown that the method could detect the early failure and improve the estimation of degradation. Liao et al. [7] extract repetitive transients by using frequency domain multipoint kurtosis to diagnose bearing faults, computational accuracy was improved through redefining the kurtosis in the frequency domain.

Crest factor
The CF is defined as the ratio of peak signal value to the RMS level, this indicator is frequently used to characterize vibration data. Values of CF for healthy bearings are typically in the range of 2.5 to 3.5, increasing when a defect is present [22]. Crest factor can be calculated using Eq. (5): Reference after name "Heng and Nor [23] performed a comparative study using statistical parameters including CF and KU applied to the sound and vibration signals from bearings to assess their relative abilities to detect defects [ 23]. Their results confirmed that statistical methods may be employed to identify the type of defect in the bearings. Moreover, their results showed that there is no significant advantage in using more advanced beta function parameters applied to the vibration signals to identify faults in rolling element bearings than KU of CF.

Peak value
Peak value is the maximum value of the wave, from the average to the highest points is a simple measurement that shows when impacts happen and is useful for low level faults. Specifically, the peak values are observed throughout discrete sequential time intervals and then analysed, the analysis involves the peak values, the spectra from the peak value time waveform and the autocorrelation coefficient [3]. Wang et al. [24] use peak-based multiscale decomposition for fault detection in rolling element bearings. Their novel peak-based approach combines envelope demodulation with multiscale decomposition and was validated by using rolling bearing element faults and it was found to enhance fault features and detection.

Energy index
As mentioned above as the magnitude of the defect or fault develops the amplitudes of the CF and KU reduce back to more normal values. The Energy Index (EI) is a technique that is used to overcome this inadequacy. Al-Balushi et al., [25] have defined the EI as: "a ratio of the root mean square of the segment of the signal (RMS segment) to the overall root mean square (RMS overall) of the same signal" [25]. They successfully applied the EI technique to detect the presence of a defect in both simulated and experimental data for a bearing. EI can be calculated using Eq. (6): We have explored various time-domain techniques to extract signal features and make condition assessments. Several of the presented techniques are available off-the-shelf. The simplicity of these techniques means they can often be computed in real time and can identify the moment at which a change to the time series takes place.

Frequency domain techniques
Frequency domain analysis involves extraction of features that can be found in a particular frequency band [26]. Zhou et al. [27] showed that frequency domain analysis techniques can give intuitive and distinct defects in bearings.
Bearing defects are categorised as either localised or distributed [28]. This research concentrates on local faults, or defects, that develop on bearing raceways. Single-point defects have the useful feature that they produce a characteristic frequency determined by bearing geometry and rotational speed, and which can be easily calculated.
Defects such as deformations occurring during manufacture or installation, or prolonged wear, tend to be referred to as generalised roughness. Defects such as these usually generate a broad spectrum of machine vibrations, and the raw data cannot locate the defect. Thus, to detect their location needs specialised processing techniques [29].
Each element of a roller bearings has a frequency uniquely corresponding to its dynamic behaviour. There will be a characteristic frequency for each of: the outer race ( ); the inner race ( ), the roller/ball ( ), and the cage / train ( ). It is useful to calculate of the characteristic frequencies of a bearing because they may indicate the location of a fault. The defect in the bearing will generate vibration impulses at regular intervals. These impulses contain energy over a wide band of frequencies and will excite the fundamental or resonant frequencies of each of the elements, these are the fundamental vibration frequencies. For rolling element bearings in which the outer race is stationary and the inner race rotates, there are four characteristic defect frequencies, see Eqs. (7-10) [28].
Train/Cage fundamental frequency (FTF): Ball fundamental frequency inner (BPFI): Ball fundamental frequency outer (BPFO): Ball circular/Spin frequency (BSF): where is the speed of the shaft in Hz, the rolling element and bearing pitch diameters are and respectively, represents the number of rolling elements, and represents the contact angle. For a single defect on one bearing component, each time the inner race rotates axially, the impact generates an impulsive force of each rolling element. The impulsive force has a frequency of occurrence which corresponds with the fundamental frequency of the faulty component.

Fourier transform (FT)
Fourier transform is used for mapping the time-domain function into the frequency-domain [30,31]. Today, the Fast Fourier transform (FFT) is a commonly used technique to obtain the frequency-domain signal from the time-domain. However, time information is lost in the process, so that the Fourier transform (FT) cannot indicate when a particular event occurred. Such a loss can be vital when exploring the growth of faults, because transient events can be the most important element in the signal.
In an attempt to make good this deficiency, Gabor [32] portioned the signal into contiguous sections (known as windows), so the FT analysed a small section of the signal each time [32]. The time-domain signal within each window was then transformed into the frequency-domain. Note that the time window acts as a multiplying function and must be tailored mathematically so as not to introduce bogus results [33].
An important characteristic of these windows is that their final and initial values are zero, or close to zero, to avoid the time signal appearing as a sequence of rectangular step functions. Today, a wide range of such windows is readily available. The Short-Time Fourier Transform (STFT) reveals a spectrum for each window. The time interval for each of the windows is known, so the individual spectra can be combined into a 3-D map with co-ordinates: frequency, time and amplitude.
This method has two shortcomings: (i) it is not possible to simultaneously have high-quality resolution in both frequency and time-domains because of the uncertainty principle, (ii) the duration of the window is fixed, once chosen it cannot be changed, but in many signals the frequency content does change with time, requiring greater precision at some times than others, [34]. Suppose the signal to be of brief duration, clearly a short window will be chosen, however, the shorter the window is the wider the corresponding frequency band and the less the resolution of the frequency.
Wavelet analysis has attracted extensive interest as a diagnostic tool for condition monitoring of bearings. The FT is limited to presenting the time-domain signal as sine and cosine functions, however, wavelet analysis can describe a signal for the entire spectrum of interest by using wavelet functions (WFs) of different, or variable, scales.

Envelope analysis
With envelope analysis, a high pass filter is employed to remove the lower frequencies from the high-frequency part of the spectrum [35]. As a result, envelope analysis can detect peaks that usually cannot be found in the noise carpet or noise floor [35]. Envelope analysis is recognised as a well-known algorithm for bearing fault analysis and Feng et al. [36] study several envelope techniques including spectral correlation, Hilbert transform and band-pass squared rectifier which showed similar levels of accuracy. In the case of a fault, such as a spall, each time the bearing passes over the spall there is a small click, the number of clicks and the rpm provide the clicks per minute on the FFT (fast Fourier transform). Here the click would normally be visible, however, the low-frequency part of the spectrum is crowded so it is not easy to identify smaller peaks. The envelope technique overcomes this by a process of modulation which takes the click frequency away leaving the trace which is at a lower frequency from which the envelope analysis identified the defective bearing [35]. Thus, through employing envelope analysis it is possible to amplify the fault frequencies through demodulation which reveals the envelope signal [37].

Limitations of traditional diagnosis techniques
The development of the proposed system or the justification for the need to develop the system is based on the limitations of the individual techniques. Although it should be noted that the advantages of these techniques are still required to be integrated with the use of other techniques towards a comprehensive system that draws on the advantages of all adopted techniques and resolves their disadvantages.
Disadvantages associated with the FFT include that the spectrum is not clear enough for the identification of fault peaks. Furthermore, it cannot identify non-stationary signals because they are based on a peak signal [38]. Because of this limitation, the time-frequency technique is more suitable for non-stationary signals.

Machine learning algorithms
Artificial intelligence methods have been applied to pattern recognition in machine diagnostics. However, training data and knowledge about the faults are needed to train the models [39]. A lack of efficient procedures for gaining this data has made the application of appropriate AI techniques more difficult. Widely used AI techniques for machine diagnostics include artificial neural networks (ANNs), fuzzy logic systems (FLSs), expert systems (ESs), and evolutionary algorithms (EAs).
An increasingly popular approach to defect detection and diagnosis is the use of ANNs used for modelling engineering systems. The feed-forward neural network (FFNN) is extensively used for diagnosis of machine faults [40,41]. The multi-layer perceptron is a particular class of FFNNs which, when trained with the back-propagation (BP) algorithm, is very widely used for pattern recognition, classification and machine defect diagnosis [42,43].
For machine learning algorithms there are two main approaches, classical methods and deep learning-based approaches both having various advantages for the diagnosis and prognosis of RUL in bearing elements.

Artificial neural network
An artificial neural network (ANN) is essentially a model with a set of interconnected relationships between inputs and desired outputs. ANNs are based on the behaviour of the neural networks that are found in the human brain. As such an ANN can be considered as a machine learning system that contains neurons that form interconnected links between inputs and outputs for processing information. It is possible to train these connections.
More specifically, the ANN involves functions which include multiplication, summation and transfer. Multiplication is carried out by neurons by assigning a weighting to the inputs and then adding them together, the sum of the weights is the inputted to transfer function. Weights are changed automatically to improve compliance of the model in relation to the data [4]. Data-driven ways for machine learning in prognostics employ numerical algorithms including neural networks [44] and a popular data-driven method is artificial neural networks.
However, ANN models are not suitable for constant and rapid fluctuations found in a system [38]. Thus, they are not suitable in this way and to improve prognosis models it is important to consider the physics of the wear evolution process where model-based approaches offer better results compared to data-driven approaches [38].
Dharmawan et al. [45] developed a fault diagnosis system for rotating machines using a combination of ANN and continuous wavelet transform (CWT). Feature extraction for the CWT involved putting the data into different types which included root mean square, kurtosis and power spectrum density as inputs for the ANN, their methods returned an accuracy for damage detection of 99.72 % [45]. Gomez et al. [46] developed an automatic condition monitoring system for detecting cracks in rotating machines, they combined ANN with Wavelets Packet transform together with Radial Basis Function which is applied to vibration signals, these additions to ANN optimised the success rate, returning close to 100 % probability of detection with 1.77 % false alarms. Beretta et al. [47] validate a method for predicting bearing faults based on an ensemble of an ANN.

Principle component analysis (PCA)
PCA analyses data using a multivariate technique to derive observations that are described by inter-correlated dependent variables [2]. PCA is essentially an algorithm that shows the data's internal structure in a way that makes the variance in the data clear and explainable.
The use of PCA increases the accuracy of the fault diagnosis, this is where PCA identified features are used instead of the normal 13 features, where the increase in accuracy is from 88 % to 98 %. This efficiency is achieved with a limited amount of input features in comparison to using original features.
Where a dataset is multivariate and is seen as a set of coordinates that are found in a high-dimensional data space, the PCA will give the user a projection that has a lower dimension.
The features of bearing defects are characteristically sensitive and may change due to different conditions, and PCA is an effective way for feature selection and it provides a way of manually choosing representative features for the purposes of classification. De Moura et al. [48] employed PCA and neural networks to investigate pre-processed signals from Detrended Fluctuation Analysis (DFA) and Rescaled-Range Analysis (RSA) in the detection of the severity of bearing faults. PCA and ANN were used for pattern recognition, however, it was determined in their study that pattern recognition from vibration analysis using PCA was inferior to that of ANN [48]. PCA models have been improved by incorporating two statistical process monitoring properties, namely, static and dynamic [49].
Mohanty and Raju [19] in averting bearing failure studied the vibration acoustics of ball bearings using a wavelet-Based Multi-Scale Principle Component Analysis with FFT. The algorithm derives the frequency range from the ball bearing operation which helps to determine the frequency of the vibration without the perplexing frequency components [19]. The main advantage that they gained from using this approach was that it allowed feature segmentation from the channels that were independent to the direction of the propagation of the bearing fault, essentially the PCA simultaneously auto-correlates and cross-correlates the signal [19]. Wang et al. [50] demonstrated a method for reliability assessment of rolling bearings using kernel principal component analysis, using this approach feature extraction is achieved using time, frequency, and time-frequency domains and it was found that using KPCA accurately reflected the performance of the degradation process.

K-nearest neighbours (k-NN)
k-NN is an algorithm that is a non-parametric way for classification or regression. Specifically, in this method, the class of an object is the output which is shown by a majority vote of the nearest -neighbour. Early use of the k-NN algorithm has been used for data mining, furthermore, k-NN has also been used for distance analysis for each data sample to find out if it should belong to a certain fault class. Baraldi et al. [51] present a diagnostic system for detecting the beginning stages of fault degradation through isolation of the bearing and then classification of the defect, this system was based on a hierarchy of k-NN classifiers, the system was found to be satisfactory in diagnostic performance.
Wang et al. [52] proposed a real-time fault diagnosis system for predictive maintenance of rolling bearings using a k-NN algorithm. Their system used a pre-processed signal and feature parameter extraction thereafter training and optimisation of the fault diagnosis model and it was found that a diagnosis model that uses a k-NN algorithm was more effective than diagnosis based on other algorithms such as C.45 and CART and therefore, is suitable for predictive maintenance of rolling bearings [52].
Weighted nearest neighbour (WKNN) is a new methodology within k-NN developed by Sharma et al. [53]. It is a squared inverse feature weighting technique that improves the performance of the k-NN classifier and can optimise the computation complexity and classification accuracy [53].
Yan et al. [54] presented a hybrid intelligent fault diagnostic model for rolling bearings that combines a k-NN classifier with a stacked sparse auto-encoding network (SSAE). Their model used the advantage of the k-NN algorithm that it can deal with multi-classification problems and improve the accuracy of their model [54]. Overall, in comparison to traditional methods using deep neural networks for feature extraction avoids being overly dependent on professional knowledge and improves the accuracy of the fault classification [54].

Ensemble learning
Zhang et al. [55] predict RUL of rolling element bearings using ensemble learning which is considered to be a typical machine learning approach and has been promising in pattern recognition. However, this approach had been very rarely used for RUL and Zhang et al. [55] propose to achieve this through merging multi-piece information and then updating it dynamically.
In response to the problem of strong ambient noise interfering with the collection of bearing signals making it difficult to accurately identify faults, Liang et al. [56] presented an improved ensemble method using deep belief network (DBN), their method was shown to significantly improve the fault diagnosis.
To improve rolling bearing fault diagnosis Li et al. [57] presented an enhanced selective ensemble deep learning method. The ensemble learning was implemented using enhanced weighted voting together with class-specific thresholds and the results showed that their method was more accurate and more robust in recognising the different types of faults in comparison to other ensemble learning methods [57]. In response to the issue of there being a broad application of technology for fault diagnosis and the associated limitation of the application of a single deep model, Ma and Chu [58] proposed an ensemble deep learning diagnosis method using multi-objective optimisation which was shown to be more adaptable in comparison to other ensemble and single deep methods. Furthermore, Xu et al. [59] recognised that most fault diagnosis methods find it difficult to learn representative features from raw data. In response to this issue, Xu et al. [59] proposed that deep learning with its ability to perform automatic feature extraction should be combined with ensemble learning, in this case, random forest (RF) ensemble learning, due to its ability to improve generalisation performance and accuracy of classifiers.

Deep learning-based approaches
Predicting remaining useful life (RUL) for rotating equipment is increasingly important for condition-based maintenance and it has been shown that deep learning prognosis methods are showing promise for bearings and gears. Deep belief networks and associated deep learning methods are a popular way for approaching the processing and analysis of big data and it has the ability to provide important features from the data that can be used for prediction of RUL [60]. Furthermore, due to the deep nature of these approaches, they can mine hidden information because of its multiple-layered structure [60].
Deep learning (DL) is an area within machine learning and is based on algorithms that are inspired by neural networks of the human brain. Deep learning is about improving the learning algorithm and making it easier to use [5]. A benefit of deep learning is that it can carry out automatic feature extraction from raw data through the ability of algorithms to learn representations using feature learning through exploiting the unknown structure of the input in order to reveal good representations [5]. It is important to note that DL approaches are essentially large neural networks that use large amounts of data and require large computers [5].
Furthermore, they are promising in the prediction of RUL for rotating equipment [61]. Concerning this, the deep learning approach proposed by Deutsch [61] was designed to overcome the limitations associated with signal processing and feature extraction requiring specific modelling and expertise. Specifically, their method which used vibration and acoustic emissions together with state transition modelling and a data-driven particle filter was validated using real bearing and gear run-to-failure test data [61].
It is important to make a distinction between the idea of deep learning and artificial network. The 'deep' refers to the idea that there many hidden layers in the network. The transition from the classic 'shallow' machine learning algorithms to deep learning is the result of many reasons. Firstly, data explosion, this is where there is an explosion in the amount of available data which means there has been a return of large-scale datasets in some domains. For the numerous applications which include diagnosing bearing faults these large data sets are not easily accessible, they are difficult to acquire which can also take some time.
With smaller datasets, classical machine learning algorithms can be equal in performance or even outperform DL networks. Where there is an increase in the data the deep learning can outperform classic machine learning algorithms. Secondly, there has been an evolution in algorithms as there has been an increase in techniques that have matured in control of the training process for deeper models for achieving greater speed and improved convergence as well as improvement in generalisation. Thirdly, there has been an evolution in hardware. To train deep networks extensive computation is required and performing this with the GPU accelerates the training process. GPU facilitates the parallel functioning of computational compatibility with computational capability together with deep neural networks, this makes GPU invaluable in training deep learning algorithms, furthermore, more powerful GPUs allow for quicker setup times.
These aforementioned factors allow the application of deep learning algorithms to a number of applications that are data related. There are a number of advantages associated with using deep learning algorithms and they include the following:

Convolutional neural network (CNN)
Convolutional neural networks are inspired by animal visual cortices and were first used for the detection of image patterns hierarchically including simple and complex features. The lower layers will have lower level features, and higher levels will detect features that are higher level, built on the lower level features.
About the architecture of CNN, the one-dimensional temporal raw data is taken from the accelerometers and are stacked in a two-dimensional vector-like image representation before a convolutional layer conducts feature extraction, thereafter, down sampling takes place in the pooling layer. This convolution and pooling combination are repeated numerous times to make the network deeper. The output from hidden layers is passed to connected layers and the output is then passed to a top classifier based on Sigmoid or Softmax for bearing fault detection.
Xu et al. [59] propose a fault diagnosis method based on CNN and random forest ensemble learning and achieved a high level of accuracy in bearing fault diagnosis and was an improvement on standard deep learning methods and traditional methods. CNN can learn features automatically from the inputted data and has the potential to overcome traditional methods [62].
Belmiloud et al. [5] use CNN as part of their method for determining RUL of rolling element bearings. Specifically, extracted features are fed into a deep CNN to construct a health indicator. Hoang and Kang [62] propose a method for bearing fault diagnosis using CNN where vibration signals are used directly as input and therefore, there is no need for feature extraction. Their method was found to be highly accurate even in noisy environments [62]. Specifically, the method transformed 1-D vibration signals into 2-D images taking advantage of CNN effectiveness in image classification [62]. The issue of noise was also addressed by Zhao et al. [39] who said that diagnosis is difficult for planetary gearboxes due to planetary noise. They propose a diagnosis method using synchrosqueezing transform (SST) and deep CNN together with envelope time-frequency representations, their method automatically recognised the planet bearing fault type and also removed interference from the time-frequency spectrum effectively avoiding misdiagnosis [39].
The problem of traditional time or frequency domain analysis, in that they cannot extract features effectively, has been addressed by Zhang et al. [63] who introduce an enhanced CNN method using short-time Fourier transform, scaled exponential linear unit and hierarchical regularisation, the results of their experimentation showed that the method had higher accuracy of fault diagnosis than other deep learning methods [63].
Liu et al. [64] solve the problem of information losses when using the fusion process through an ensemble CNN model for diagnosing bearing faults, specifically, the model used one multichannel fusion convolutional neural network branch and two 1 dimensional CNN branches, the former extracts features from sensory data and the latter extracts from the inherent features which reduces the loss of information, overall the model was found to be more effective and robust than other models [ 64].

Auto-encoders
Autoencoders have their origins in pre-training methods used for artificial neural networks. After years of development this approach has become popular as a method of feature learning, furthermore, it has been described as being a greedy layer-wise method for pre-training.
An ANN is used to train the auto coder which is comprised of an encoder and a decoder whereby the encoder's output is the input for the decoder. The mean square error between input and output as the loss function is taken by the ANN to generate the output through imitation of the output. Once the ANN has been trained the decoder is discarded and the encoder is retained. This means that the feature representation is the output of the encoder, and it is this that is used in the classifier in the next stage.
However, although autoencoders was an early approach, more modern and state-of-the-art approaches are more focussed on deep neural networks with many layers that employ a backpropagation algorithm [5]. Haidong et al. [65] presented a novel method for intelligent bearing fault diagnosis using ensemble learning which analyses experimental vibration signals together with a combination strategy to achieve an accurate diagnosis. The results showed the method removes the need to depend on manual feature extraction and overcomes limitations associated with deep learning models [65].

Deep belief network (DBN)
In deep learning, a deep belief network (DBN) is considered as a compilation of unsupervised networks, for example, restricted Boltzmann machines or auto-encoders, whereby the hidden layer of each sub-network acts as a visible layer which is used to train the DBN. Furthermore, for SAE the fused features are inputs into the DBN for the classification of faults.
Deutsch [61] validated deep learning prognosis methods using big data collected from bearing test rigs to determine bearing RUL predictions. One of the methods used by Deutsch [61] was the Deep Belief Network. Specifically, the DBN was trained using FFT features and upon completion of training a fine-tuning layer was added to the DBN, the results were promising as the approach added robustness to the architectures and reduced the probability of poor results [61].
Shao et al. [66] propose a novel continuous deep belief network optimised with the use of a genetic algorithm in order to adapt the characteristics of signals, the proposed method was tested using bearings and it was found to be superior in terms of accuracy and stability than traditional methods. Shen et al. [67] recognise the limitations of diagnosis mechanisms that use manual feature extraction, as a solution deep learning can learn representative features in the data without the need for much prior knowledge. Shen et al. [67] presented a new method for bearing fault analysis named hierarchical adaptive DBN optimised by using the Nesterov momentum, their model was validated using vibration signals from bearings and it was found that the method shows more satisfactory performance than conventional DBN and support vector machine [67].

Recurrent neural network (RNN)
Data in the recurrent neural network method is processed in a recurrent behaviour as opposed to a feed-forward neural network. The flow path of this goes from the hidden layer back to the flow path when it is sequentially unrolled. Because it is a sequential model it can capture and model any sequential relationship that can be found in time series or sequential data [68].
A recent approach that used deep recurrent neural network (DRNN) was proposed by Jiang et al., [69] which used stack recurrent hidden layers as well as LSTM units, furthermore, an adaptive learning rate was also used to improve training performance. This approach returned high accuracy results of 94.75 % and 96.53 % [69]. Wu et al. [70] proposed a novel approach for fault prognosis using recurrent neural network, specifically, recurrent neural network was used with the degradation sequence of equipment, again with LSTM units, and showed significant performance in RUL prediction. Xie and Zhang [71] also recognised the beneficial potential of deep network neural algorithms, LSTM and recurrent neural network and proposed a novel approach for fault prognosis using LSTM based on the vibration signal of rotating equipment, the outcomes were successful in improving machine condition monitoring and health management.

Generative adversarial network (GAN)
Goodfellow et al. [72] first proposed GAN in 2014 and is comprised of two parts which are the generator and the discriminator which compete with each other whereby tries to confuse while the latter tries to distinguish samples generated by the former. They are competing with each other to gain increased capability for imitating the original data samples and then to discriminate iteratively. Lin et al. [73] Deep Belief Network (DBN) -Built with RBMs -each hidden layer is the visible layer for the next -Undirected connection at the top two layers -Training is supervised or unsupervised Zhang et al. [63] Pros: -Can use layer by layer learning strategy to start the network -The likelihood is maximised through tractable inferences Zhang et al. [63] Cons: -Training can be computer-intensive and expensive due to initialisation and sampling Zhang et al. [68] Recurrent Neural Network -Analyses 1-D temporal or sequential data -Used for applications where output independent on previous computation Zhang et al. [63] Pros: -Can memorise sequential events -Can model time dependencies -Can receive inputs of variable lengths Zhang et al. [63] Zhang et al. [68] Generative Adversarial Network (GAN) -Uses a generator and a discriminator to generate images which imitate real photos -Augments data where labelled data is scarce Zhang et al. [63] Pros: -Does not need modification when moving to new applications -Does not have a deterministic bias Cons: -Training is unstable because requires finding Nash equilibrium -Difficult to learn how to create discrete data Zhang et al. [63] and is important for management of safety, scheduling and lowering costs [74]. Prognostics focuses on using automated methods for the detection, diagnosis, and analysis of system degradation and to estimate remaining useful life (RUL) within accepted operating parameters before failure occurs or performance degrades to intolerable levels. The success of a condition monitoring management strategy is dependent upon such automated procedures, which send out notices related to impending failure of equipment [75] to provide maintenance personnel with a lead time.
International Standard Organization (ISO) 13381-1 [76] defines prognostics as "the estimation of the Time to Failure (ETTF) and the risk of existence or later appearance of one or more failure modes".
In the development of a prognostic method, the required outcome is the prediction of failure time [77,78]. Predictions require that the system and associated condition processes are understood in addition to historic conditions that could affect the future behavior [79]. Because predictions are concerned with an event that is uncertain, approaches to prognostics consider basic assumptions about degradation characteristics, prognosis is based on the following four notions [80,81]: 1) All systems degrade due to time and environmental factors.
2) Ageing and damage are monotonic processes that reveal themselves both physically and chemically.
3) Symptoms of ageing are detectable prior to failure 4) Symptoms of ageing can be correlated with a model of ageing and, therefore, the RUL of individual systems can be estimated.
At the initial stages of a system's lifetime the components are working properly. Each operational function has a specific initial level of health, which is mostly stable at the early stages, which continues until an early incipient fault takes place. Over time as operation continues system failure becomes increasingly likely, which can lead to system damage and ultimately a catastrophic failure. It is important to note that system failure and catastrophic failure take place at different times. The early detection of such failures is critical in the estimation of the RUL. In order to detect fault characteristics it is necessary to have interactions between diagnostics and prognostics. The overall objective is to increase awareness of the state so that it takes place close to the point of the first incipient fault [82].
Degradation resulting from an initial system fault in a system continues to increase reaching a critical state that leads to system failure. The system begins with initial health and a variation that is acceptable and considered normal. The diagnostic representation here relates to the task of in-depth exploration of a failure that is a direct result of an initial leading symptom. Based on the location of this symptom, prognostics is about taking a multi-step approach before prediction [83].
Prognostic prediction is practiced between the initial failure detection and actual failure, where diagnostics are practiced [83]. Consequently, the goals of diagnostics and prognostics are somewhat different but carried out in the same field. Since both lifetime estimation methods are applied for condition monitoring, they both include stages for data acquisition and signal processing.
A variety of prognostic techniques with numerous tools and methods have been mentioned in the literature [84]. Current prognostic methods can be categorised into three general classes according to prediction and forecasting approaches: physics-based, data-based and hybrid approaches. Each approach has its particular disadvantages and advantages [85,86].

Physics-based models
Physics-based models (PbM) describe the physics of the equipment and the failure mechanism [87]. In the physics-based approach the evolution of the degradation is defined and therefore, for this reason this approach is considered a degradation model [88]. Table 3 shows methods within the PbM prognostics approach. The mathematical models of degradation are usually used in applications that are associated with health levels. PbMs use a combination of formulas for fault growth together with knowledge related to the principles of damage mechanics. Using these models there is the assumption that where the mathematical model for component degradation is accurate then it can provide sufficient knowledge for prognostic outputs. Uses cumulative damage with consideration of operating conditions. Different physics parameters are necessary to apply the model A well know approach in PbM is Crack growth modelling. The Paris and Erdogan law [89] is employed in a number of applications to associate the stress intensity factor range with crack growth within the fatigue stress regime. The defect growth rate of rolling element bearings has been evaluated with a variation of Paris' Law. This law states that defect growth is correlated with defect area. Predicted and actual defect sizes are compared followed by the application of a recursive least-square scheme to derive an adaptive prognostic model for defect growths [91,92], however, a slight difference in a parameter may lead to a large prediction error. Li and Choi [91], and Li and Lee [92] presented a Paris' law crack growth model using Finite Element Analysis (FEA), whereby estimation of stress is based on the size of the defect, bearing geometry, speed and load. The performance of this approach depends on crack size calculation accuracy using vibration data, and any calculations carried out are computationally-intensive so that the probability of an observation can be evaluated. Forman law of linear elastic fracture is another PbM model [90]. Oppenheimer and Loparo applied data from condition monitoring together with Forman law crack growth physics to life models. Because identification of the defect area size during operations is often instantaneous, this approach could be impractical for certain situations. Furthermore, assumptions may be oversimplified and there needs to be an examination of the model parameters before application [90].
Orsagh et al. presented a stochastic variation based on the Kotzalas-Harris model to for estimating failure progression and time-to-failure together with the Yu-Harris life equation for determining fatigue spall initiation. The current state of the bearing is estimated through calculating the time-to-spall initiation, followed by prediction of the future bearing health model [94,95]. Marble and Morton [93] presented a PbM method for spall propagation using FEA to estimate spall size, material stress, rolling element speed and load. Their model can predict the number of cycles until failure with consideration of the principles of damage mechanics [93].
While there are numerous application domains and that there are differences between models, it is the case that the aforementioned models share common features making them appropriate for specific uses. Generally, PbM approaches are conventional and employ mathematical methods to understand failure modes [96]. In comparison to data-driven approaches PbM approaches are more accurate. However, PbM approaches may not be effective for estimating RUL in complex systems because it better for specific components rather than systems as a whole [86]. Furthermore, it is very difficult to describe the behavior of individual components within complex systems using unique mathematical equations. These approaches require a significant amount of experimentation [97], therefore, a specific PbM method for a specific system is not applicable to a different system.

Data-based model
In the data-based (DbP) approach monitoring data is processed in order to model prognostics instead of building mathematical models of system behaviours [98]. Data-based models involve precursors to failure and RUL by considering past data and estimating the output using monitoring data. One major advantage of data-based models is their simplicity in terms of calculation. These can be conducted using an algorithm to process past degradation patterns to estimate future degradation [78]. Although all data-based approaches are driven by data and -to some degreeuse models they may be categorised as either model-based or data-driven [99].
In order to provide a RUL prediction, the prognostic model assumes that an accurate mathematical model for damage (or degradation) can use condition-monitoring data from the damage qualification step, which is initially progressed from system sensor measurements and the estimation algorithm. The model parameters for the remaining useful life prediction step are obtained from this designed combination model. The degradation model is expressed as a function of system data and model parameters. Damage classification and data are provided to the model while the damage model parameters of the estimation algorithm use these in order to describe the degradation behavior occurring in the system. Then, RUL is forecast based on the calculated model parameters [100].
Model-based prognostic methods include several techniques that employ dynamic models of the predicted process, such as Kalman and particle filtering method, autoregressive moving average (ARMA) techniques, and empirical methods [101]. Generally, these models are Bayesianbased, whereby the state of a process can be estimated using minimum prediction covariance derived from measurements. Kordestani et al. [102] proposed a method for fault prognosis based on neural networks and recursive Bayesian algorithm resulting in a high level of accuracy. They are capable of predicting current and future states of nonlinear systems and estimate the RUL based on deterioration trends before the asset arrives at the predefined threshold [100]. This reflects the fact that their involvement in the processes of RUL prediction is high. On the other hand, they do not directly learn from data, and they have shortcomings in terms of different operational trajectories.
Data-driven approaches for condition monitoring maintenance are calculated through analysing condition-monitoring data [74]. A prognostic approach is effective because its data discovery is simple and consistent in complex processes [86]. Data-driven models make it simple to integrate innovative approaches creating an inclusive prognostic approach [103]. A data driven approach to prognosis using a combination of principle component analysis with exponential degradation was proposed by Anis [104] using kurtosis and it was successful for prognostic of rotating shaft failure.
Common data-driven models in the prognostic field are explained in Table 4. These models provide prognostic applications with the ability to learn without being explicitly designed. Most of these approaches focus on the development of RUL prediction algorithms that can change when exposed to new but similar data. Conventional data-driven methods consist of simple forecast models including exponential smoothing, Gamma process [105,106], and autoregressive models [107].
The main advantage of these techniques is that their implementation is simple, which can be carried out on a programmable estimator [117]. On the other hand, these basic projection techniques are based on the assumption that there is an underlying stability in the system being monitored, and they rely on historical performance to predict future degradation. This reliance is risky and can result in inaccurate forecasts when any trend changes or the data ends during a fluctuation. More complex systems such as Bayesian Networks [118,120] and fuzzy logic systems [119,120] have been developed for data-driven prognostic projections. These applications can extract useful knowledge from complex data in various forms but the prognostic accuracy in multistep ahead predictions is limited in cases where long projections are expected but test trajectories are short. Artificial Neural Networks (ANNs) are a widely used data-driven approach to prognostics [121,52]. ANNs are computational algorithms that use data processing neurons to perform machine learning, this neural network is used as a connected computation of output values from the input data [122,123]. ANNs are a key feature in establishing a set of interconnected relationships between inputs and desired outputs and they can be trained for performance [124].
Neural networks can effectively model systems comprising an extensive class of non-linear regression, non-linear dynamic systems, data reduction and discriminant models [125]. In certain applications such as to complex engineering systems, the measured data from the system may be imprecise, and the looked-for results may not be directly linked to the input data. In such cases, ANNs are suitable to model such systems where the precise relationship between input and output data is not known [126]. ANNs, therefore, are applicable to predictive algorithms for complicated systems and can be quicker and easier to use in comparison to other predictive methods. Thus, ANNs are a widely used data-driven prognostic method, and widely adopted across different disciplines.
Predictions using ANN can be difficult where there is insufficient knowledge about the degradation process [127,128]. ANNs use actual sample points from the time series from the network modelling, specifically, the next value of the time series is predicted, without the need to feed back to input values [129,130]. Where the prediction horizon is longer using multiple steps the ANN output should be fed back externally to the initial time series for a fixed number of steps; the regression components from these input series, previously formed from sample points from the initial time series, are gradually replaced by values that have already been predicted [128]. However, these replacements may lead to an imbalance in the predictions which may imitate training data [131]. However, ANNs provide sound computational mapping between raw data and outputs required in the network prediction [132].

Prognostics performance evaluation
Prognostic metrics could be seen as a standardised method of communication whereby users show their results and compare their findings [133]. Overtime due to numerous prognostic implementations in different disciplines, there have been metrics established for assessing forecasting performance including the work of Saxena, Leao, and Goebel [134,135,136]. These metrics sets validate prognostic application performance. Because they are concerned with applications with an availability of run-to-failure data and actual RUL is known, they are particularly useful for the model development stage whereby the metrics could be used for integration of prognostic procedures [133].
These metrics are defined mathematically and their relationship to prognostics design are presented in the equations that follow:

Error ( )
Error is the deviation from a desired target [133]: where is the estimated value (ETTF) and is the actual output value (ATTF). In this definition, the absolute error (AE) is the following:

Mean absolute error (MAE)
Where there is more than one instance, an average of the absolute error terms is calculated using the mean absolute error [137]. This measures the closeness of estimations to the actual outcomes:

Mean square error (MSE)
MSE is a risk function for calculating the average of the square values of the errors [137]. When the vector of these predictions is gained and the vectors of actual remaining useful life is available, the MSE can be calculated by:

Challenges of prognostics
Wang [117] said that successful prognostic applications are still difficult to find for complex engineering systems despite the fact there has been numerous algorithms proposed for calculating remaining useful life. There are numerous issues and misconceptions in the development of these algorithms which presents a challenge to prognostic applications used in complex systems. Furthermore, because data characteristics are complex, stochastic and exhibit nonlinear degradation patterns it is difficult to model systems accurately [138].
The challenges of prognostics and the associated requirements addressed in literature are provided here.

Lack of common data sources
Advanced prognostic techniques development is an active area of research, for a model to show promise there is need to collect data throughout the lifetime of a machine. As faults evolve further estimations by prognostic systems are required to detect these faults [139].

Uncertainty in predictions
There are a number of factors that influence system degradation and therefore, the associated noise, uncertainty and errors found in the data. Data-driven prognostics depend on the assumption that historical data can allow for a model for estimating remaining useful life, however, future operational conditions are unknown and require projection. However, it may not be possible to provide results when the length of the test data is short and there is a requirement for long term projections, in this case there may be a failure in prognostic accuracy.

Validation issues
Predicting remaining useful life is not the same as predicting future behaviour validated after a whole life cycle and reaching a real failure. If the dataset provides the actual time to failure the prediction can be validated using the metrics discussed in the above (2.10). Because metrics developed within forecasting are different from prognostic applications, they are a widely used method of validation and findings can be compared. Furthermore, metrics can be used to assess algorithmic performance in prognostic applications and are useful at the algorithm development stage whereby metric feedback is employed for fine tuning the prognostic algorithms [140]. However, where test data is short there is a higher risk of error, or if there are fluctuations resulting from operational conditions, the results can also be negatively affected.

Conclusions
In this paper, a comprehensive review of roller bearing diagnosis and prognosis has been reviewed. This review showed various techniques have been used in combination with vibration analysis for the diagnosis and prognosis of bearings element faults, however, most of these algorithms are valid for certain cases and cannot be generalised.
Although many researchers have addressed fault detection within roller element bearings there are many challenges facing fault detection. One of the most challenging scenarios is bearing fault detection where the vibration or acoustic signals are strongly masked by noise from more dominant components such as gears and shafts. An example of this scenario is the gearbox of a wind turbine, which presents difficulty for bearing fault detection. Therefore, this research aims to develop a diagnostic and prognostic tool to detect bearing faults and predict the remaining useful life under a strongly masked signal.
Although a large variety of prognostic models have been proposed and well reported in technical literature, an efficient prognostic methodology with accurate life prediction for real world application has yet to be developed. For accurate prognostics, it is essential to conduct prior analysis of the system's degradation process, its failure patterns and to maintain a log of the history and condition of the machine throughout its life. Future research on the area of vibration analysis will address the gap related to prognosis capability through machine learning and propose a way to reduce dependency on training data to establish life prediction.