Different formulations for benchmarking or measuring performance must be understood, as well as the meaning of each measured result. It is equally important to understand the capabilities of the neural network, especially for multivariate analysis, and any limitation that arises from the network or from the data on which the network is trained. This article is intended to address the need for bringing an increased understanding of artificial neural networks to the medical community so that these powerful computing paradigms may be used even more successfully in future medical applications.
In the third section we consider the internal architecture of a neural network, its paradigm for adjustment training of internal parameters weights , and its proven capabilities. In the fourth section we discuss measurements of the performance of neural networks, including the use of verification data sets for benchmarking a trained network, the receiver operating characteristic ROC plot, and confidence and prediction intervals.
In the fifth section we discuss the uses of neural networks in medical decision support systems.
The sixth section summarizes our major findings concerning the current role of neural networks in medicine, and Section 7 concludes with a vision of the emerging uses of neural networks in medicine. Neural networks can play a key role in medical decision support because they are effective at multifactorial analysis. More specifically, neural networks can employ multiple factors in resolving medical prediction, classification, pattern recognition, and pattern completion problems. Many medical decisions are made in situations in which multiple factors must be weighed.
For example, there usually is no single laboratory test that can provide decisive information on which to base a medical action decision. Yet it should be clear that, when multiple factors affect an outcome, an adequate decision cannot be based on a single factor alone. Yields are measured and results are plotted Fig. Statistics for experimenters. This temperature is the same as that used for the first series of runs; again, a maximum yield of roughly 75 g is obtained.
Second set of experiments showing yield versus temperature, with the reaction time held fixed at minutes. Figure 3 shows the underlying response surface for this illustrative example. The horizontal and vertical axes show time and temperature. Yield is represented along a third axis, directed upward from the figure, perpendicular to the axes shown.
- The Scarlet Ruse (Travis McGee, Book 14).
- Comprehensive Natural Products II: Chemistry and Biology: Modern Methods in Natural Products Chemistry.
- Simulation Training: Fundamentals and Applications: Improving Professional Practice Through Simulation Training.
If you only change one variable at a time and attempt to maximize results, you may never reach the top of the hill. Because predictive analysis, via artificial neural networks and other statistical methods, is treated as a global minimization problem e. We submit the following conclusions from this illustrative example: 1 Searching for a single key predictive factor is not the only way to search.
Varying a single factor at a time, and keeping other factors constant, can limit the search results. Neural network technology is intended to address these three key points. The inputs of a neural network are comprised of one or many factors, putatively related to the outputs. When multiple factors are used, their values are input simultaneously. The values of any number of these factors then can be changed, and the next set of values are input to the neural network. Any predictions or other results produced by the network are represented by the value s of the output node s , and many factors are weighted and combined, followed by a weighting and recombining of the results.
Thus, the result or prediction does not have to be due to a single key predictive factor but rather is due to a weighting of many factors, combined and recombined nonlinearly. In a medical application, the problem to be solved could be a diagnostic classification based on multiple clinical factors, whereby the error in the diagnosis is minimized by the neural network for a population of patients. Alternatively, the problem could be the classification of images as containing either normal or malignant cells, whereby the classification error is minimized.
Both the diagnostic problem and the image analysis problem are examples of classification problems, in which one of several outcomes, or classes, is chosen or predicted. A third example would be predicting a drug dosage level for individual patients, which illustrates a function fitting application. However, we seldom have that luxury because usually no single factor is sufficiently definitive, and many decisions must be made based on weighing the presence of many factors.
In this case, neural networks can provide appropriate decision support tools because neural networks take into account many factors at the same time by combining and recombining the factors in many different ways including nonlinear relations for classification, prediction, diagnostic tasks, and function fitting. The building of modern medical databases has suggested an emergent role for neural networks in medical decision support.
For the first time in history, because of these growing databases, we have the ability to track large amounts of data regarding substantial and significant patient populations. Equally important is the need for feedback between the analysis of results and the data collection process. Sometimes the analysis of results indicates that the predictive ability of the data is limited, thus suggesting the need for new and different data elements during the collection of the next set of data.
For example, results from a new clinical test may be needed. As each database is analyzed, neural networks and statistical analysis can demonstrate the extent to which disease states and outcomes can be predicted from factors in the current database. The accuracy and performance of these predictions can be measured and, if limited, can stimulate the expansion of data collection to include new factors and expanded patient populations. Databases have been established in the majority of major medical institutions. These databases originally were intended to provide data storage and retrieval for clinical personnel.
However, there now is an additional goal: to provide information suitable for analysis and medical decision support by neural networks and multifactorial statistical analysis. Comparisons of computerized multivariate analysis with human expert opinions have been performed in some studies, and some published comparisons identify areas in which neural network diagnostic capabilities appear to exceed that of the experts.
Traditionally, expert opinions have been developed from the expert's practical clinical experience and mastery of the published literature. Currently we can, in addition, employ neural networks and multivariate analysis to analyze the multitude of relevant factors simultaneously and to learn the trends in the data that occur over a population of patients. The neural network results then can be used by the clinician. Today, each physician treats a particular selection of patients. Because a particular type of patient may or may not visit a particular physician, the physician's clinical experience becomes limited to a particular subset of patients.
A physician then could have access to neural networks trained on a population of patients that is much larger than the subset of patients the physician sees in his or her practice. When a neural network is trained on a compendium of data, it builds a predictive model based on that data. The model reflects a minimization in error when the network's prediction its output is compared with a known or expected outcome.
For example, a neural network could be established to predict prostate biopsy study outcomes based on factors such as prostate specific antigen PSA , free PSA, complex PSA, age, etc. The network then would be trained, validated, and verified with existing data for which the biopsy outcomes are known. Performance measurements would be taken to report the neural network's level of success. These measurements could include the mean squared error MSE , the full range of sensitivity and specificity values i. The trained neural network then can be used to classify each new individual patient.
The predicted classification could be used to support the clinical decision to perform biopsy or support the decision to not conduct a biopsy. This is a qualitatively different approach a paradigm shift compared with previous methods, whereby statistics concerning given patient populations and subpopulations are computed and published and a new individual patient then is referenced to the closest matching patient population for clinical decision support. With this new multivariate approach, we are ushering in a new era in medical decision support, whereby neural networks and multifactorial analysis have the potential to produce a meaningful prediction that is unique to each patient.
Neural network for predicting the outcome of a prostate biopsy study. PSA: prostate specific antigen. Artificial neural networks are inspired by models of living neurons and networks of living neurons. Artificial neurons are nodes in an artificial neural network, and these nodes are processing units that perform a nonlinear summing function, as illustrated in Figure 5. Synaptic strengths translate into weighting factors along the interconnections. Illustration of an artificial neural network processing unit. Each unit is a nonlinear summing node. The square unit at the bottom left is a bias unit, with the activation value set at 1.
Other early contributors included Anderson, 22 Amari, 23 Grossberg, 24 Kohonen, 25 Fukushima, 26 and Cooper, 59 to name a few of the outstanding researchers in this field. These contributions were chronicled by Anderson and Rosenfeld.
- Men, Love & Sex: The Complete Users Guide for Women.
- Daughter of deceit?
- Understanding Credit Derivatives and Related Instruments (Academic Press Advanced Finance);
These additional structures included optical filters, additional neural layers with fixed random weights, or other layers with unchanging weights. Nevertheless, the single layer of trainable weights was limited to solving linear problems, such as linear discrimination, —drawing a line, or a hyperplane in n dimensions, to separate two areas no curves allowed. In , Werbos 4 extended the network models beyond the perceptron, a single trainable layer of weights, to models with two layers of weights that were trainable in a general fashion, and that accomplished nonlinear discrimination and nonlinear function approximation.
The MLP typically is organized as a set of interconnected layers of artificial neurons. Each artificial neuron has an associated output activation level, which changes during the many computations that are performed during training. The most popular squashing function is the sigmoid function, as follows: 1 in which x is the input to the squashing function and, in the neural network, is equal to S j node j , the sum of the products of the incoming activation levels with their associated weights. This incoming sum for node j is computed as follows: 2 in which w ji is the incoming weight from unit i, a i is the activation value of unit i, and n the number of units that send connections to unit j.
Artificial neurons nodes typically are organized into layers, and each layer is depicted as a row, or collection, of nodes. Beginning with a layer of input neurons, there is a succession of layers, each interconnected to the next layer. This means that each neuron is connected to all neurons in the next layer, as depicted in Figure 7.
The last layer is the output layer, and activation levels of the neurons in this layer are considered to be the output of the neural network. Weights are adjusted in such a way that each weight adjustment increases the likelihood that the network will compute the desired output at its output layer. Because training can be difficult, a tremendous number of computational options and enhancements have been developed to improve the training process and its results.
Adjustment of weights in training often is performed by a gradient descent computation, although many improvements to the basic gradient descent algorithm have been designed. The amount to which the neural network is in error can be expressed by the MSE calculation is determined as follows: 4 in which d i,p is the desired output of output unit i for input pattern p, P is the total number of patterns in the data set, n is the number of output units, and the sums are taken over all data patterns and all output units.
In this mountainous terrain, we are seeking the minimum i. The weights at which this minimum is attained correspond to the x and y values associated with the bottom of the valley. An analogy would be that the x and y values corresponding to the weights would be the longitude and latitude of the bottom of a valley in the mountainous terrain.
A local minimum is the bottom of any valley in the mountainous terrain and a global minimum is the bottom of the lowest valley of the entire mountainous region. The gradient descent computation is intuitively analogous to a scenario in which a skier is dropped from an airplane to a random point on the mountainous terrain. The skier's goal is to find the lowest possible elevation. At each point, the skier checks all degrees of rotational direction, and takes a step in the direction of steepest descent. This will result in finding the bottom of a valley nearby to the original starting point.
This bottom certainly will be a local minimum; it may be a global minimum as well. This determines the activation values of the input nodes. Next, forward propagation ensues, in which first the hidden layer updates its activation values followed by updates to the output layer, according to Equation 3. Next, the desired known outputs are submitted to the network. A calculation then is performed to assign a value to the amount of error associated with each output node.
The formula for this error value is as follows: 5 in which d j is the desired output for output unit j, a j,3 is the actual output for output unit j layer 3 , f x is the squashing function, and S j,3 is the incoming sum for output unit j, as in Equation 2. After these error values are known, weights on the incoming connections to each output neuron then can be updated.
Figure 8 illustrates the updating of the weight along a single connection; the error delta value of the target neuron is multiplied by the activation value of the source neuron. Figure 9 illustrates the backward flow of computations during this training paradigm. The derivation for these equations is based directly on the gradient descent approach, and uses the chain rule and the interconnected structure of the neural network.
Mathematical details were given by Rumelhart and McClelland 5 and Mehrotra et al. A In the forward propagation step, the activation levels of the hidden layer are calculated and the activation levels of the output layer then are calculated. The activation levels of the output layer become the output pattern, which is aligned with the target pattern the desired output. Then, for each output unit, weights on its incoming connections are adjusted. C An error delta value is calculated for each unit in the hidden layer. Next, for each hidden unit, weights on its incoming connections are adjusted.
Reproduced with permission from Dayhoff JE. Neural network architectures: an introduction. New York: Van Nostrand Reinhold, There are other training algorithms that seek to find a minimum in the error surface e. Some of these algorithms are alternatives to gradient descent, and others are strategies that are added to the gradient descent algorithm to obtain improvement.
These alternative strategies include jogging the weights, reinitializing the weights, the conjugate gradient technique, the Levenberg—Marquant method, and the use of momentum. Some of these techniques are tailored toward finding a global rather than a local minimum in the error surface, such as using genetic algorithms for training.
Other techniques are good for speeding the training computations, 29 or may provide other advantages in the search for a minimal point. In addition, there is a set of techniques for improving the activation function, which usually has been a sigmoid function, but has been expanded to include other nonlinear functions 30 , 31 and can be optimized during training. It should always be recognized that results with neural networks always depend on the data with which they are trained.
Neural networks are excellent at identifying and learning patterns that are in data. However, if a neural network is trained to predict a medical outcome, then there must be predictive factors among the inputs to the network before the network can learn to perform this prediction successfully. In more general terms, there have to be patterns present in the data before the neural network can learn the patterns successfully.
If the data contain no patterns or no predictive factors, then the neural network performance cannot be high. Thus, neural networks are not only dependent on the data, but are limited by the information that is contained in those data. It is important to consider that a neural network is simply a set of equations. If one considers all different types of neural networks, with varied node updating schemes and different training paradigms, then one has a collection or class of systems of equations.
As such, neural networks can be considered a kind of language of description for certain sets of systems of equations. These equations are linked together, through shared variables, in a formation diagrammed as a set of interconnected nodes in a network. For example, when a neural network is used to describe a set of equations, the network diagram immediately shows how those equations are related, showing the inputs, outputs, and desired outputs and intuitively is easier to conceptualize compared with methods that involve equations alone.
The conceptual paradigm of an array of simple computing elements highly interconnected with weighted connections has inspired and will continue to inspire new and innovative systems of equations for multivariate analyses. These systems serve as robust solvers of function approximation problems and classification problems. Substantial theory has been published to establish that multilayered networks are capable of general function approximation.
This is a powerful computational property that is robust and has ramifications for many different applications of neural networks. Neural networks can approximate a multifactorial function e. This capability gives neural networks a decided advantage over traditional statistical multivariate regression techniques. Figure 10 illustrates a function approximation problem addressed by a neural network. The plot at the top is a function for which a neural network approximation is desired.
The neural network is trained to input the value of x, and to produce, as output, an approximation to the value f x. The neural network is trained on a section of the function. Theoretic results show us that regardless of the shape of the function, a neural network with appropriate weights and configuration can approximate the function's values e. Function approximation.
JPMA - Journal Of Pakistan Medical Association
Top: a function f x. Bottom: a neural network configured to determine an approximation to f x , given the input x. Neural network weights exist that approximate any arbitrary nonlinear function. Sometimes function approximation problems occur directly in medicine, such as in drug dosing applications.
Time series prediction tasks, such as predicting the next value on a monitor or blood test, also are useful in medicine and can be restated in terms of function approximation. Classification tasks, such as classifying tissue into malignant stages, occur frequently in medicine and can be restated as function approximation problems in which a particular function value represents a class, or a degree of membership in a class.
Diagnosis is an instance of a classification task in which the neural network determines a diagnosis e. Thus, the neural network's function approximation capabilities are directly useful for many areas of medicine, including diagnosis, prediction, classification, and drug dosing. Figure 11 illustrates the approach for applying neural networks to pattern recognition and classification. On the left are silhouettes of three different animals: a cat, a dog, and a rabbit.
Each image is represented by a vector, which could be the pixel values of the image or a preprocessed version of the image. This vector has different characteristics for the three different animals. The neural network is to be trained to activate a different output unit for each animal. After each presentation, the internal weights of the neural network are adjusted. After many presentations of the set of animals to be learned, the weights are adjusted until the neural network's output matches the desired outputs on which it was trained.
The result is a set of internal weight values that has learned all three patterns at the same time. How does such a network recognize three different objects with the same internal parameters? An explanation lies in its representation of the middle hidden layer of units as feature detectors. The first layer of weights combines specific parts of the input pattern to activate the appropriate feature detectors.
Alternatively, the output value in the 0—1 interval, in response to a pattern presented to the network, may be interpreted as an index reflecting a probability or degree of membership that associates the pattern with the class corresponding to the output node.
Advanced Applications for Artificial Neural Networks
The general function approximation theorem then is extensible, and can be used to demonstrate the general and powerful capabilities of neural networks in classification problems. There exist many different performance measurements for neural networks. Simple performance measures can be employed to express how well the neural network output matches data with known outcomes.
The area under the ROC plot is a more extensive performance measure to use with classification neural networks. Fuller use of ROC plots is a more elaborate way of demonstrating the performance of classification neural networks and will be discussed in more detail later. A comparison with experts can be conducted to measure performance. Does the neural network predict an outcome or diagnosis as often as a trained expert does?
Does the neural network agree with the experts as often as they agree with each other? It is important to note that when training a neural network, three nonoverlapping sets of data must be used. Typically, a researcher would start with a compendium of data from a single population of patients. This data then is divided, at random, into three subsets: the training set, the validation or testing set, and the verification set.
The training set is used for the adjustment of weights during training. The testing or validation set is used to decide when to stop training. The need for the testing set is motivated by the graph in Figure 12 , which illustrates an RMS of the training and testing sets plotted as a function of the number of training iterations. The RMS on the training set decreases with successive training iterations, as does the RMS on the test set, up to a point. Then, at n 1 iterations, the RMS on the test set begins to increase whereas the RMS on the training set continues to decrease.
A classic training curve in which a training set gives decreasing error as more iterations of training are performed, but the testing validation set has a minimum at n 1. To avoid overtraining, the training is stopped when the error on the test set is at a minimum, which occurs at n 1 iterations in Figure To report the performance of the network, a separate dataset, called the verification set, then is used.
The performance on this set is reported and used in determining the validity of the neural network results. The verification set is not used at all during training or testing validation. The hyper-parameterized structure of neural networks creates complex functions from the input that can approximate observed outcomes with minimal error 6. As such ANNs can approximate any continuous function, as postulated by the Universal Approximation Theorem, but the immediate structures of a fitted model do not provide insights into the relative importance, underlying relationships, structures of the predictors or covariates with the modelled outcomes.
As an example, neural networks can be used to predict clinical deterioration in adult hematologic malignancy patients 7. The input is a set of predictors P diastolic blood pressures, heart rate, white blood cell count, etc. The black box issue is that the approximation given by the neural network will not provide insight into the form of f as there is often no simple relationship between the network weights and the property being modeled.
Even the analysis of the relevance of input variables is challenging 8 , and neural networks do not generate a statistically identifiable deterministic model. For a given training dataset and network topology, there can be multiple neural networks with different weights that generate very similar predictions of the modeled property, complicating understanding of the ANN and relevant predictors.
Recognizing the issue, many methods have been developed to help subject-matter audiences to understand the underlying functions of ANN. This article provides an overview of several of the most common algorithms and illustrates how they perform using examples. The R statistical programming language version 3.
Applications of Artificial Neural Networks in Medical Science
The above code generates seven predictors descriptors : age , gender , lactate lac type of patients type type , use of vasopressor vaso , white blood cell count wbc , and C-reactive protein crp. The mortality mort is binary clinical outcome variable which takes two values 0 for alive and 1 for deceased. The relationship between mort and predictors are built under the logistic regression model framework. However, we will build a neural network model in the following example. There are several types of machine learning methods that could be used to generate models but we use a neural network model in our examples.
Several packages are available in R to develop an ANN. The nnet package version 7. The caret Classification And Regression Training package version 6. The above code fits two neural network models. The first one model mod is fit with the train function using all seven predictors. In the model, cross validation is used to select the best model by tuning the parameters size and decay.
In general, the size parameter defines the number of hidden nodes in the network, which are essentially free parameters that allow flexibility in the model fit between input and output layers. Increasing the number of hidden nodes increases the flexibility of the model but at the risk of over-fitting. The decay parameter is more abstract, in that it controls the rate of decay for changing the weights as used by the back-propagation fitting algorithm. In the example, the number of hidden units size parameter is chosen from 5, 10 and 15, and the decay parameter is chosen from 0, 0.
The second model, modcont , contains only continuous variables as the predictors including age , lac and wbc. The weights connecting neurons in an ANN are partially analogous to the coefficients in a generalized linear model. The combined effects of the weights on the model predictions represent the relative importance of predictors in their associations with the outcome variable.
However, there are many weights connecting one predictor to the outcome in an ANN. The large number of adjustable weights in an ANN makes it very flexible in modeling nonlinear effects but imposes challenges for the interpretation. Garson proposed that the relative importance of a predictor can be determined by dissecting the model weights 11 , All connections between each predictor of interest and the outcome are identified.
Pooling and scaling all weights specific to a predictor generates a single value ranging from 0 to 1 that reflects relative predictor importance. The relative importance can be computed in R with the NeuralNetTools version 1. The relative importance of each predictor is shown in the above output. The results suggest that surgery type is the most important predictor of mortality outcome, followed by lactate lac and age.
Figure 1 displays the relative importance of each predictor.
Digital Product and Platform Engineering Services
The garson function returns a ggplot object 14 , and the default aesthetics can be further modified with the following code. Figure 2 shows how the plot aesthetics can be changed from the default output of the garson function. Furthermore, the neural network model can also be visualized with the plotnet function. Figure 3 is a diagram of the neural network architecture. The black lines indicate positive weights and grey lines indicate negative weights.
Line thickness is in proportion to the relative magnitude of each weight. The first layer receives the input variables I1 through I8 and each is connected to all nodes in the hidden layer H1 through H5. The output layer O1 is connected to all hidden layer nodes. Bias nodes provide a function that is similar to the intercept term in a linear model and are shown as connections to the hidden and output layers in the plot.
The relationship between an outcome and a predictor might differ given the context of the other predictors i. In essence, the method generates a partial derivative of the response with respect to each descriptor and can provide insight into these complex relationships described by the model. By holding other predictors at their minima, at 20 th , 40 th , 60 th , 80 th quantiles, and at their maximum 6 groups in the figure , the relationships between the outcome probability and predictor of interest varies widely for the variable wbc.
If there are categorical discrete variables in a given dataset, the following code can be used to look at the model response across the range of values for one explanatory variable at a time. The final plot was created using facets for selected levels of the discrete explanatory variables. You can specify which continuous explanatory variable to evaluate with the varex object and you can change the quantile at which the other explanatory variables are held constant using the quant object.
Continuous variables are held at their median, generating only 1 level. Partial dependence of an outcome variable on a predictor of interest can be calculated as follows 18 :. The plot for a single predictor can be created with the pdp package version 0. The above code loads and attaches the pdp and viridisLite packages to the R workspace. The viridisLite package version 0. The package is also designed to aid perception by readers with the most common form of color blindness.
The output in Figure 6 shows the relationship between age and yhat. The response variable is shown in logit scale.
- 1. Introduction.
- JPMA - Journal Of Pakistan Medical Association?
- The Ashgate Research Companion to Fan Cultures.
- Artificial Neural Networks Applications and Algorithms.
- A Practical Application of Machine Learning in Medicine | Macadamian.
The plotPartial function is used to display a more detailed partial plot. It operates on objects returned by the partial function and provides many options to modify the plot. The pdp package can also be used to plot the response variable and two predictors as a 2-D or 3-D plot. Fortunately, the pdp package makes this work easy. To investigate the simultaneous effect of two predictors on the predicted outcome, the pred.