Weather Forecasting ... On-Line

Verification Measures


Introductory Comments

Over the years many verification measures have been devised. Some are specific to one type of forecasts while others can be applied to multiple forecast elements. The purpose of this web page is to describe several verfication indices or scores that are commonly used in meteorology. You are referred to the two references listed at the end of this web page for details on these measures.

2x2 Contingency Table

Let's consider a forecast event that either occurs or does not occur. This event is categorical, non-probablistic, and discrete. Examples of this type of forecast include rain versus no rain, or a severe weather warning. This type of forecast can be represented by a 2x2 contingency table.

  Observed  
Yes No
Forecast Yes a b a+b
No c d c+d
  a+c b+d n = a+b+c+d

This table looks at four possible outcomes:

  • an event is forecast and the event occurs (a)
  • an event is forecast and the event does not occur (b)
  • an event is not forecast and the event occurs (c)
  • an event is not forecast and the event does not occur (d)
  • Several measures can be derived from this table of data.

    Percent Correct (PC)

    The percent correct is the percent of forecasts that are correct. Specifically,

    PC = (a+d)/n

    PC ranges from zero (0) for no correct forecasts to one (1) when all forecasts are correct.

    It is not useful for low frequency events such as severe weather warnings. In these cases there is a high frequency of "not forecast/not occurred" (d) events. This gives high PC values that are misleading with regard to the forecasting of the low frequency event. This shortcoming is compensated for by the next three scores.

    Hit Rate (H)

    The Hit Rate is the fraction of observed events that is forecast correctly. It is calculated as follows:

    H = a/(a+c)

    It is also known as the Probability of Detection (POD). It ranges from zero (0) at the poor end to one (1) at the good end.

    False Alarm Ratio (FAR)

    The False Alarm Ratio is the fraction of "yes" forecasts that were wrong, i.e., were false alarms. It is calculated as follows:

    FAR = b/(a+b)

    It ranges from zero (0) at the good end to one (1) at the poor end.

    Threat Score (TS)

    The Threat Score (TS) or Critical Success Index (CSI) combines Hit Rate and False Alarm Ratio into one score for low frequency events. It is calculated as follows:

    TS = CSI = a/(a+b+c)

    This score ranges from zero (0) at the poor end to one (1) at the good end. It does not consider "not forecast/not occurred" (d) events.

    CSI, POD and FAR are used extensively by the National Weather Service to verify severe thunderstorm and tornado warnings.

    Bias (B)

    Bias compares the number of times an event was forecast to the number of times an event was observed. Specifically,

    B = (a+b)/(a+c)

  • if B=1 (unbiased), the event was forecast the same number of times that it was observed
  • if B>1 (overforecast), the event was forecast more than it was observed
  • if B<1 (underforecast), the event was forecast less than if was observed
  • Finley Tornado Forecasts

    John Finley was a sergeant in the U.S. Army Signal Service in the 1880s. He made 2,803 tornado forecasts for 18 regions east of the Rocky Mountains. His results can be examined using the 2x2 contingency table.

      Tornadoes Observed  
    Yes No
    Tornadoes Forecast Yes 28 72 100
    No 23 2,680 2,703
      51 2,752 2,803

    These data produce the following statistics:

  • PC = 0.966
  • H = POD = 0.549
  • FAR = 0.720
  • TS = CSI = 0.228
  • B = 1.96
  • You can see why the PC is not a good measure for tornadoe forecasting. These statistics say that the forecast was correct 96.6 percent of the time. However, 95.6 percent was due to "not forecasting" their occurrence. A POD of 54.9 percent is admirable considering the state of meteorology in the 1880s but a FAR of 72.0 percent is rather high. The bias implies a tendency to overforecast the occurrence of tornadoes.

    An interesting slant on these statistics occurs when the Finley data are modified to indicate that no tornadoes are forecast.

      Tornadoes Observed  
    Yes No
    Tornadoes Forecast Yes 0 0 0
    No 51 2,752 2,803
      51 2,752 2,803

    For this case these data produce the following revised statistics:

  • PC = 0.982
  • H = POD = 0
  • FAR = 0
  • TS = CSI = 0.018
  • B = 0
  • In this case where tornadoes were never forecast, the PC went up to 98.2 percent.

    Skill Scores

    Skill Score (SS) measures forecast accuracy relative to some set of control or reference forecast. It essentially answers the question:

    Is my forecast better or worse than the control or reference forecast?

    The control or reference forecast include:

  • Climatological Average Value:
    daily climatological values of weather parameters are available to serve as a reference value; in middle latitudes forecasts should be able to do better that climatology
  • Persistence Forecast:
    in situations where there is typically little change over the forecast period, persistence can be a useful measure of forecast skill
  • Random Forecasts:
    randomly generated forecasts can be a useful control
  • Model Output Statistics (MOS):
    computer generated statistical forecasts are common today; if you cannot improve on these forecasts, perhaps you are not needed
  • An Older Forecast Method:
    if you change forecst methods or start using a new forecast model, you can use the old method or model as the control to see if the newer method or model is better
  • Skill Score is basically the percentage improvement over the reference forecast. It is expressed as follows:

    SS = [ ( A - Aref ) / ( Aperf - Aref ) ] x 100%

    where:

  • A ... measure of accuracy
  • Aref ... measure of accuracy for the reference forecast
  • Aperf ... measure of accuracy for a perfect forecast
  • If A = Aperf, SS = 100%.

    If A = Aref, SS = 0 (no skill).

    Please note that SS can be either positive or negative.

    Two skill scores can be applied to the 2x2 contingency table. These are the Heidke Skill Score and the Gilbert Skill Score.

    Heidke Skill Score

    For the Heidke Skill Score (HSS), the reference measure is the proportion correct that would be expected by random forecasts that are statistically independent of the observations.

    From the 2x2 contingency table, the marginal probability of a yes forecast is (a+b)/n.

    From the 2x2 contingency table, the marginal probability of a yes observation is (a+c)/n.

    Thus, the probability of a correct yes forecast by chance is:

    Pyes = [(a+b)/n][(a+c)/n] = (a+b)(a+c)/n2

    The probability of a correct no forecast by chance is:

    Pno = [(b+d)/n][(c+d)/n] = (b+d)(c+d)/n2

    Let:

  • Aref = Pyes + Pno
  • A = (a+d)/n
  • Aperf = 1
  • and substitute these values into the general skill score formula. This gives the following expression for HSS:

    HSS = 2(ad-bc)/[(a+c)(c+d) + (a+b)(b+d)]

    HSS is independent of n. HSS = 1 for a perfect forecast; HSS = 0 shows no skill. If HSS < 0, the forecast is worse than the reference forecast.

    Gilbert Skill Score

    For the Gilbert Skill Score (GSS), the reference measure is the threat score (TS or CSI) for random forecasts using the following:

  • Tref = aref / (a+b+c)
  • aref = (a+b)(a+c)/n (from previous change calculation)
  • Tperf = 1
  • TS = a / (a+b+c)
  • This gives the following expression for GSS:

    GSS = ( a - aref )/( a - aref + b + c )

    In this formula, aref depends upon n.

    GSS is also known as the Equitable Threat Score (ETS).

    Mean Absolute Error

    Mean Absolute Error (MAE) is a scalar accuracy measure that is calculated as follows:

    MAE = ( SUM | yk - ok | ) / n

    where:

  • yk = kth forecast value
  • ok = kth observation value
  • n = number of forecast-observation pairs
  • Each forecast-observation pair gives an error value. This measure sums the absolute values of these errors and divides by the number of forecasts to give an average error. MAE = 0 for a perfect forecast.

    MAE is commonly used for verifying maximum and minimum temperature forecasts.

    Mean Square Error

    Mean Square Error (MSE) is a scalar accuracy measure that is calculated as follows:

    MSE = [ SUM ( yk - ok )2 ] / n

    It is similar to MAE in that:

  • yk = kth forecast value
  • ok = kth observation value
  • n = number of forecast-observation pairs
  • In this case the forecast-observation errors are squared before they are averaged. MSE is more sensitive to large errors (outliers) than MAE. Large errors contribute more to the average than a linear difference. MSE = 0 is a perfect forecast.

    The square root of MSE is the Root Mean Square Error (RMSE).

    Mean Error

    Mean Error (ME) is a scalar accuracy measure that is calculated as follows:

    ME = [ SUM ( yk - ok ) ] / n

    It is similar to MAE and MSE in that:

  • yk = kth forecast value
  • ok = kth observation value
  • n = number of forecast-observation pairs
  • ME allows both positive and negative errors to be used in the average. As a result, ME is also known as bias.

  • if ME = 0, there is no bias
  • if ME > 0, forecasts, on average, are too high
  • if ME < 0, forecasts, on average, are too low
  • Brier Score

    Brier Score (BS) is an accuracy measure for probabilitic forecasts of dichotomous events. A dichotomous event is one that either occurs or does not occur. For example, rain either occurs or does not occur. Brier Score is calculated as follows:

    BS = [ SUM ( yk - ok )2 ] / n

    where:

  • yk = kth probability forecast value
  • ok = 1 if the event occurs
  • ok = 0 if the event does not occur
  • n = number of forecast-observation pairs
  • This is essentially the formula for MSE. BS ranges from zero (0) to one (1) with BS = 0 is a perfect forecast.

    Brier Score can be converted to a skill score by assuming the following::

  • A = BS
  • Aref = climatology or MOS
  • Aperf = BS = 0
  • Thus, the formula for Brier Skill Score (BSS) is:

    BSS = 1 - (BS/BSref)

    If BS > BSref, then BSS < 0 or your forecast is worse than the reference forecast.

    If BS < BSref, then BSS > 0 or your forecast is better than the reference forecast.

    If you use MOS as the reference forecast, and your BSS is negative, you can be replaced by MOS.

    Reliability Diagram

    Another approach to evaluting probability forecasts is the Reliability Diagram. This diagram plots probability along the x-axis and the verifying frequency of occurrence of each probability value along the y-axis.

    For example, if for all of your 40 percent probability of precipitation (POP) forecasts you found that it rained on 35 percent of these forecasts, you would plot 0.35 in the y-axis direction for 40 percent POP on the x-axis. Ideally, you would like to see a 40 percent frequency of occurrence for your 40 percent POP forecasts.

    Evaluating Gridded Forecasts

    Most of the verification measures discussed up to now were applied to point forecasts. However, you can also verify model grid forecasts using some of these measures. Described below are several verification measures that have been used for any type of gridded forecast data.

    Mean Squared Error

    For any grid of forecast values, you can apply the MSE and RMSE formulae from above. The forecast-observation pairs are corresponding forecast and analyzed grid values. Thus for any forecast grid you can calculate a MSE/RMSE value for that grid. These numbers are a broad measure of accuracy in terms of an average error across the grid.

    If you are interested in more detail about grid errors you will likely look at other measures. Something as simple of a plot of the forecast minus analyed value at each grid point provides such detail.

    Anomaly Correlation

    Anomaly Correlation is a more complex statistical approach to grid verification. You start by determining the anomaly at each grid point using the following method:

  • For each grid point, subtract the climatological average value (cm)of the observed field from both the forecast value (ym) and the observed (analysis) value (om).
  • For the forecast grid point: y'm = ym - cm
  • For the observed grid point: o'm = om - cm
  • The primed values are the anomalies.
  • Use the anomaly grid point pairs to run a standard statistical correlation on the pairs of points to measure the field relationship.

    Probability Ellipse

    Another question that may arise is: How good are the patterns generated by the forecast models? For example, how close are the surface low pressure center forecasts to the observed surface low pressure center location?

    One approach is to plot the error in location as a function of error along the track and error perpendicular to the track. This type of plot gives you a sense of whether the surface low forecast is fast or slow, or to the left or to the right of the track.

    Using these data you can develop a set of probability ellipses that indicates the chance of a surface low being within a specific distance of its forecast position.

    Concluding Remarks

    The purpose of this web page was to describe several verfication methods that are commonly used in meteorology. Many more than described here are available, and in some cases, variations on these measures are designed to fit what is forecast.

    If you are interested in more details on what has been described here, you are referred to the two texts listed in the references below. Remember, however, verification can become very heavy from a statistical perspective. Be prepared.


    Return to the Operational Weather Topics page

    References

  • Jolliffe, I.T., and D.B. Stephenson, 2003: Forecast Verification: A Practitioner's Guide in Atmospheric Science. Wiley, Hoboken, NJ, 240 pp.
  • Wilks, D.S., 2006: Statistical Methods in the Atmospheric Sciences, 2nd ed. Academic Press/Elsevier, New York, 627 pp.

  • last updated on 4/14/10