Journal of Building Performance Simulation ISSN: 1940-1493 (Print) 1940-1507 (Online) Journal homepage: http://www.tandfonline.com/loi/tbps20 Performance testing of energy models: are we using the right statistical metrics? Debaditya Chakraborty & Hazem Elzarka To cite this article: Debaditya Chakraborty & Hazem Elzarka (2017): Performance testing of energy models: are we using the right statistical metrics?, Journal of Building Performance Simulation, DOI: 10.1080/19401493.2017.1387607 To link to this article: http://dx.doi.org/10.1080/19401493.2017.1387607 Published online: 20 Oct 2017. Submit your article to this journal Article views: 39 View related articles View Crossmark data Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=tbps20 Download by: [UAE University] Date: 25 October 2017, At: 16:00 Journal of Building Performance Simulation, 2017 https://doi.org/10.1080/19401493.2017.1387607 Performance testing of energy models: are we using the right statistical metrics? Debaditya Chakraborty ∗ and Hazem Elzarka College of Engineering and Applied Science, University of Cincinnati, P.O. Box 210071, Cincinnati, OH 45221-0071, USA Downloaded by [UAE University] at 16:00 25 October 2017 (Received 3 April 2017; accepted 29 September 2017 ) Testing the predictive performance of energy models (EMs) is necessary to evaluate their accuracies. This paper investigates the adequacy of existing statistical metrics that are often used by professionals and researchers to test EMs. It discerns that coeﬃcient of variance of root mean squared error (CVRMSE) and mean bias error (MBE), which are prescribed in ASHRAE guideline 14, are not suitable for system-level energy model testing. It points out the limitations of CVRMSE, MBE, and also root mean squared error (RMSE). The analysis shows that the normalizing term of statistical metrics inﬂuences its accuracy in determining the predictive performance of EMs. An alternative metric (range normalized root mean squared error, RN_RMSE) is proposed that normalizes the RMSE by the range of the data as a replacement for CVRMSE. It is shown that RN_RMSE when used in tandem with R2 can provide more meaningful and accurate representation of the performance of system-level EMs. Abbreviations Ems: energy models; IDFs: input data ﬁles; R2 : coeﬃcient of determination; RMSE: root mean squared error; CVRMSE: coeﬃcient of variance of root mean squared error; MBE: mean bias error; RN RMSE: range normalized root mean squared error Keywords: energy modelling; statistical metrics; performance testing of energy models; normalized root mean squared error 1. Introduction Whole-building energy models (EMs) have been used in the past for building design and retroﬁt, measurement and veriﬁcation, code compliance, green certiﬁcation, qualiﬁcation for tax credits, and utility incentives.1 Recently, the trend is shifting towards more advanced systemlevel energy modelling techniques, algorithms, and tools for model-based control, fault detection, optimization of building operation, predictive analysis, etc. (e.g. AlHomoud 2001; Clarke et al. 2002; Lee, House, and Kyong 2004; Wetter 2011; Zhao and Magoulès 2012). The existing statistical metrics for testing the predictive performance of EMs were proposed a few decades ago for whole-building measurement and veriﬁcation (M&V) of energy conservation measures (Reddy and Claridge 2000; Reddy 2006; ASHRAE 2014a). For example, ASHRAE still prescribes the use of coeﬃcient of variance of root mean squared error (CVRMSE) and mean bias error (MBE) for calibration of whole-building EMs. This paper evaluates whether the widely used statistical metrics, including CVRMSE and MBE, are suitable for testing the predictive performance of system-level EMs such as cooling and heating energy consumption. We focus on systemlevel energy modelling because it is often considered for *Corresponding author. Email: email@example.com © 2017 International Building Performance Simulation Association (IBPSA) in-depth analysis of energy uses and identify and eliminate energy wastage. ‘Testing the predictive performance’, sometimes referred to as ‘validation of models’ in the existing literature, is an important step involved in energy modelling. History and development of testing techniques of energy modelling software is well established in Judkoﬀ et al. (1983), Bloomﬁeld (1999), ASHRAE (2013, Chapter 19), and ASHRAE (2014b, Annex 23), which includes the following elements: • Analytical tests – in which predicted results from a program or algorithm are compared to results from a known analytical solution or a generally accepted mathematical method. • Comparative tests – in which a program is compared to itself or to other programs. Bloomﬁeld (1999) showed that this type of comparison can be a very powerful way of identifying errors. This approach is adopted in this research paper. • Empirical tests – in which calculated results from a program or algorithm are compared to monitored data from a real building, test cell, or laboratory experiment. Such comparisons are often used to 2 D. Chakraborty and H. Elzarka calibrate EMs for M&V and cross-validate datadriven EMs. Table 1. Summary of statistical performance metrics used by researchers in the last two decades. Metric Downloaded by [UAE University] at 16:00 25 October 2017 Every type of tests mentioned above requires statistical metrics to report the results. For example, in comparative testing, the statistical metrics are used to represent the gap between modelled results of one program to the modelled result of another. Diﬀerent types of statistical metrics have been used by researchers for testing the predictive performance of EMs, including CVRMSE and MBE, which are prescribed in ASHRAE Guideline 14 to test calibrated EMs. Garrett and New (2016) have questioned the suitability of CVRMSE and MBE. They suggested that further work is required to evaluate the adequacy of such metrics. In this regard, the objectives of our research paper are set as follows: • To analyse the adequacy of prescribed and commonly used statistical performance metrics for validating system-level EMs. • To identify the limitations associated with each one of these statistical performance metrics. • To suggest viable alternative statistical performance metrics to overcome the identiﬁed limitations of the widely used metrics. In this research, we have used prototype IDFs (input data ﬁles) (available online2 ) developed by the U.S. Department of Energy (DOE) with EnergyPlus to generate a synthetic database for comparative testing. These prototype IDFs are used to generate various types of energy consumption data for some benchmarked buildings as per ASHRAE standards. These IDFs provide a consistent basis for research, energy code development, appliance standards, and measurement of progress toward the DOE energy goals (Torcellini et al. 2008). Using these IDFs, a controlled test set-up is created to generate different sets of energy-related data by varying the time-step granularity. A high-frequency time step (1 min) providing the synthetic baseline dataset and lower frequency time steps (15 min, 30 min, and 1 h) providing the modelled datasets. The rest of the paper is organized as follows. Section 2 explains the existing statistical metrics that are commonly used by researchers to test the predictive performance of EMs. Section 3 introduces an alternative statistical metric and explains the theoretical advantages of this metric. Section 4 outlines the methodology adopted in this paper to evaluate the adequacy of various statistical metrics. The results are provided and discussed in Section 5. Section 6 concludes this paper by reporting the major ﬁndings from this research and Section 7 highlights the future scope of work. R2 RMSE CVRMSE MBE Occurrences in research papers Dhar, Reddy, and Claridge (1999), AydinalpKoksal, Ugursal, and Fung (2002), Ben-Nakhi and Mahmoud (2004), Aydinalp-Koksal and Ugursal (2008), Lam et al. (2010), Jacob et al. (2010), Zhang et al. (2015), Deb et al. (2016) Aydinalp-Koksal, Ugursal, and Fung (2002), Zhang et al. (2015) Dhar, Reddy, and Claridge (1999), AydinalpKoksal, Ugursal, and Fung (2002), Dong, Cao, and Lee (2005); Pan, Huang, and Wu (2007), Aydinalp-Koksal and Ugursal (2008), Lam et al. (2010), Kwok, Yuen, and Lee (2011), Ke, Yeh, and Jian (2013), Kandil and Love (2014), Zhang et al. (2015) Pan, Huang, and Wu (2007), Lam et al. (2010), Ke, Yeh, and Jian (2013), Kandil and Love (2014), Zhang et al. (2015), Gestwick and Love (2014) Note: CVRMSE and MBE are the statistical metrics recommended in ASHRAE Guideline 14 for energy model calibration. Their respective tolerance values for calibrated simulated EMs should be within 15% and ± 5% when monthly data are used; 30% and ± 10% when hourly data are used. R2 denotes coeﬃcient of determination. RMSE denotes root mean squared error. 2. Statistical metrics for energy model performance testing Predictive performance of EMs is quantiﬁed by summarizing the pairwise distance between the baseline values and modelled values using statistical metrics, which is also referred to as ‘model accuracy’ or ‘error’ or ‘goodnessof-ﬁt’ (e.g. Granderson and Price 2014). The statistical metrics commonly used by researchers to test the predictive performance of EMs along with their occurrences in research papers is summarized in Table 1. Based on the literature review, the authors found that the statistical metrics used for testing the predictive performance of EMs varied from article-to-article. As a result, future researchers may have diﬃculty in replicating or comparing their results to previous work. In this research, the various statistical metrics are described and compared quantitatively to examine their robustness and ﬂexibility. Following are the descriptions of the statistical metrics given in Table 1. (For all equations described below, n is the total number of data points in the dataset; k = 1, 2, . . . , n. Yk indicate baseline values, Ŷk indicate modelled values, and μ is the mean of the baseline dataset ‘Y’): 3 Journal of Building Performance Simulation Downloaded by [UAE University] at 16:00 25 October 2017 • Coeﬃcient of determination (R2 ) indicates how well the regression estimate ﬁts the data. The formula is given by n k )2 (Yk − Y . (1) R2 = 1 − k=1 n 2 k=1 (Yk − μ) A value of 1 suggests that the model is perfect, whereas 0 indicates that there is no correlation between the modelled and baseline values. In other words, a value closer to 1 is desirable. Sometimes R2 can take negative values, which indicates that the variance of the error is more than the variance of the baseline data. In such cases, the mean of the baseline data is a better predictor than the model and such models should be treated accordingly before implementation. More details about this are provided in Section 5.4. An R2 value of 0.9 may be interpreted as: ‘Ninety percent of the variance in the baseline values can be explained by the modelled values.’ • Root mean squared error (RMSE) represents the sample standard deviation of the diﬀerences between modelled and baseline values, which is a measure of accuracy in the modelled values. Mathematically, it is represented as follows: n 2 k=1 (Yk − Yk ) . (2) RMSE = n This metric is sensitive to the scale of data and may range between 0 and ∞, where a lower value is desirable. • Coeﬃcient of variance of root mean squared error (CVRMSE) is derived by normalizing the RMSE with the mean of the data and has the advantage of providing a unit-less percentage value representing the accuracy of EMs. It is mathematically deﬁned as follows: n 2 k=1 (Yk −Yk ) CVRMSE(%) = n μ RMSE × 100. = μ × 100 (3) This metric was proposed to eliminate the dependency of RMSE on the scale of data. As per ASHRAE (2014a), the CVRMSE value for calibrated EMs developed using hourly data must be less than 30%. Lower values of this metric are desirable. • MBE indicates how well the modelled values match with the baseline values. Mathematically, it is represented as follows: n k ) (Yk − Y n × 100. (4) MBE(%) = k=1 k=1 (Yk ) Positive values indicate that the model under predicts the baseline values; negative values indicate that the model over predicts the baseline values. As per ASHRAE (2014a), the value of MBE for calibrated EMs developed by using hourly data must be within ±10%. An MBE of 0 suggests that there is no bias in the model. A major disadvantage associated with this metric is that it depicts the percentage of total diﬀerence between baseline and modelled values with respect to the total baseline value over the entire simulated period. As a result, this metric suﬀers from the cancellation of positive and negative errors, which can be misleading in terms of interpreting the true performance of EMs. 3. Proposed alternative statistical metric – range normalized root mean squared error (RN_RMSE) Normalized forms of statistical metrics are useful to compare models developed on diﬀerent datasets having dissimilar properties. As the name suggests, RN_RMSE is a normalized form of RMSE in which the range of the dataset is used for normalization. RN_RMSE is mathematically deﬁned as follows: n 2 k=1 (Yk −Yk ) n RN_RMSE(%) = maximum(Y) − minimum(Y) RMSE = × 100. range(Y) × 100 (5) This metric evaluates the accuracy of models and a lower value is desirable. It is hypothesized that RN_RMSE can provide more reliable estimates of the predictive performance of EMs as compared to CVRMSE. The reasons behind this hypothesis are discussed below in Section 3.1. 3.1. Advantage of RN_RMSE over CVRMSE from a theoretical perspective In the analysis of data, it is often desirable to convert the original data to a norm or common standard (Dodge 2003). Converting the original data to z-scores can set a standard scale for diﬀerent datasets (Salkind 2007). z-Score (Z) transformation is a technique in which the mean (μ) and standard deviation (σ ) of the original data (Y) are transformed to zero and one, respectively, using Z= Y−μ . σ (6) Transforming the baseline and predicted energy data by zscore technique was exploited in this work. However, the drawback of such a process is that the importance of the mean (μ) and standard deviation (σ ) is lost. Also, a metric based on the z-scores of the baseline and predicted energy data will not be able to determine the bias in the model because the transformation always brings back the basis to zero. For example, consider a case where all the predicted 4 D. Chakraborty and H. Elzarka values (Ŷ) are biased from the baseline values (Y) by a constant large number (H ), resulting in: Downloaded by [UAE University] at 16:00 25 October 2017 Ẑ = (Y + H ) − (μ + H ) Y−μ Ŷ − μ̂ = = = Z. (7) σ̂ σ σ Even though there is a large bias (H ) in the predictions, the resulting metric value (based on the respective z-scores) will turn out to be zero, which is undesirable. Readers may note that σ̂ = σ in Equation (7) because constant additive diﬀerence between datasets only aﬀect the measures of central tendency (e.g. mean, median, mode) but not the measures of variability (e.g. standard deviation, range). Although z-score transformation was not utilized to develop the alternative metric, its basic properties were exploited to evaluate the adequacy of RN_RMSE over CVRMSE. Z-score transformation works on the principle of shifting (by the mean) and scaling (by the standard deviation) of data to allow comparison between datasets. Accordingly, if a statistical metric can account for shifting and scaling diﬀerences between datasets, then it can be considered to be a well-normalized metric that is suitable for comparison. Scaled (referred as multiplicative) diﬀerences are witnessed when datasets have diﬀerent units-ofmeasurement, such as kilowatt-hour (kWh) or mega joules (MJ). Normalization techniques utilized in statistical performance metrics should be able to bring such diﬀerences to a notionally common scale. This is analogous to the scaling of data. Shifted (referred as additive) diﬀerences are witnessed when datasets occupy diﬀerent regions in the Euclidean space. Normalization techniques utilized in statistical performance metrics should also be able to account for such diﬀerences by normalizing w.r.t. (with respect to) the respective regions of datasets and not w.r.t. the global k ) resulting origin ‘(0, 0)’. Otherwise, residuals (Yk − Y from datasets that are far oﬀ from the origin ‘(0, 0)’ will give the illusion of being smaller in comparison to the residuals from datasets that are nearer to the origin. This is analogous to the shifting of data. In case of CVRMSE, the numerator (RMSE) is adjusted for models based on diﬀerent data scales to a notionally common scale. The basic idea is that when two datasets diﬀer in a multiplicative way by a constant ‘m’, i.e. when every data point is multiplied by ‘m’, then RMSE is not useful for comparison. In contrast, CVRMSE can be compared as the resulting values RMSE/μ and ( m × RMSE)/( m × μ) are equal, where μ is the mean of the independent variable. Therefore, for multiplicative diﬀerence between datasets, CVRMSE provides both interesting and useful information about comparative model performance. The problem with CVRMSE exists when datasets differ in an additive way by a constant ‘a’, i.e. when ‘a’ is added to every data point in the dataset. In that case, the respective CVRMSE values RMSE/μ and RMSE/(a + μ) are not useful for comparison anymore. In such cases, RN_RMSE can provide meaningful and interesting information as the resulting values RMSE/(maximum(Y) − minimum(Y)) and RMSE/ ((maximum(Y) + a) − (minimum(Y) + a)) are still equal. Notice that in case of additive diﬀerences between datasets, ‘a’ does not appear in the numerator because a uniform shift of all data points by a distance ‘a’ relative to the origin does not change the resulting RMSE value. In addition, RN_RMSE can also account for multiplicative diﬀerence between datasets as RMSE/ m × RMSE/ m× (maximum(Y) − minimum(Y)) and (maximum(Y) − minimum(Y)) are equal. The abovementioned concepts are illustrated using an example as shown in Figure 1. The statistical drawbacks of RMSE, CVRMSE and MBE discussed above are conceptually illustrated in Figure 1 with linear models developed using least squares method in MATLAB.3 In these plots, hypothetical peak cooling energy consumption datasets are used to model their relationship w.r.t. the outside temperature. The drawbacks are interpreted via two cases that are as follows: Case 1 – Consider the comparison between Figure 1(a) and 1(b) where both datasets have same variation (σ ), but diﬀerent mean (μ). The dataset in Figure 1(a) occupies the rectangular region between ‘(18, 107)’ and ‘(28, 230)’, whereas the dataset in Figure 1(b) occupies the rectangular region between ‘(18, 207)’ and ‘(28, 330)’. This is a case where an additive diﬀerence exists between datasets as mentioned above, where Y2 = Y1 + 100. Both models result in exactly the same residual terms as shown in the plots. The corresponding CVRMSE and MBE values are diﬀerent, whereas the RMSE, RN_RMSE, and R2 values are same for these two models. Therefore, the point is, when both the models are developed on very similar datasets, involved the same level of diﬃculty in model development, generated using the same algorithm, resulted in exactly the same residual terms, then why are the respective CVRMSE and MBE values diﬀerent? Case 2 – Consider the comparison between Figure 1(a) and 1(c) where both datasets have diﬀerent variation (σ ) and mean (μ). This is a case where multiplicative diﬀerence exists between datasets as mentioned above, where Y3 = Y1 × 3.6. Notice that the normalized metrics (CVRMSE, MBE, R2 and RN_RMSE) do not change their respective values in the case of multiplicative diﬀerence, which is desirable. The cases discussed above imply that RN_RMSE and R2 provide more accurate normalized estimates of the predictive performance of models because they can normalize additive diﬀerences between datasets. In general, energy consumption related datasets may diﬀer additively, multiplicatively or a combination of both. For example, Figure 2 shows that it is not uncommon for additive differences to exist between energy consumption datasets. Notice the shift of region in Euclidean space corresponding 5 Journal of Building Performance Simulation 220 Y1 (kWh) 200 = 42.28 kWh = 187.65 kWh range = 120.53 kWh 180 160 140 Datapoints Model: Y1=12.98*X-122.4 Pred bnds (90%) 120 Residuals - Y1 (kWh) 40 MBE = -0.025% CVRMSE = 10.03% RMSE = 18.82 kWh RN_RMSE = 15.61% 2 R = 0.78 30 20 10 0 -10 -20 -30 320 Y2 (kWh) 300 20 21 22 23 X (° C) 24 25 26 27 = 42.28 kWh = 287.65 kWh range = 120.53 kWh 280 260 240 Datapoints Model: Y2 =12.98*X-22.4 220 Pred bnds (90%) 40 MBE = -0.016% CVRMSE = 6.54% RMSE = 18.82 kWh RN_RMSE = 15.61% 2 R = 0.78 Residuals - Y2 (kWh) 30 20 10 0 -10 -20 -30 19 800 750 Y3 (MJ) 700 20 21 22 23 X (° C) 24 25 26 27 = 152.21 MJ = 675.54 MJ range = 433.91 MJ 650 600 550 500 Datapoints Model Y3 = 46.728*X-440.64 Pred bnds (90%) 450 400 150 Residuals - Y3 (MJ) Downloaded by [UAE University] at 16:00 25 October 2017 19 MBE = -0.025% CVRMSE = 10.03% RMSE = 67.75 MJ RN_RMSE = 15.61% 2 R = 0.78 100 50 0 -50 -100 19 20 21 22 23 X (° C) 24 25 26 27 Figure 1. Conceptual plots to illustrate the advantage of the proposed metrics from a statistical perspective: (a) linear model on dataset A, (b) linear model on dataset B and (c) linear model on dataset C. Note: σ is the standard deviation; μ is the mean of data. 6 D. Chakraborty and H. Elzarka 2200 Total electricity consumed by facility Electricity consumed from cooling 2000 1800 Energy Consumption (kWh) 1600 1400 1200 1000 800 600 Shift of region in Euclidean space signifies the presence of additive difference between two data sets. 400 200 Downloaded by [UAE University] at 16:00 25 October 2017 0 0 100 200 300 400 Time (hours) 500 600 700 800 Figure 2. Plot to illustrate that additive diﬀerence exists between energy consumption datasets. Note: The data shown above are synthetically produced in EnergyPlus with a prototype input data ﬁle as described in Section 1. to each dataset, which is similar to the situation mentioned in Case 1. In reference to Figure 2, it is obvious that residuals resulting from any model built on the total electricity consumption dataset will appear to be comparatively smaller when normalized w.r.t. the reference point (0, 0) in the Euclidean space. Therefore, statistical metric values may appear comparatively smaller or bigger when the mean or sum of the data is used for normalization. However, if the actual originating points (minimum values) of respective datasets are included in the denominator term of normalized metrics, then more accurate estimates of model performance can be obtained. RN_RMSE (Equation (5)) directly includes the minimum (originating point) in the denominator term, whereas R2 (Equation (1)) refers to the minimum indirectly in the denominator term. Variability in energy consumption datasets provides a way to describe how much they diﬀer and allows comparison between them. Diﬀerence in the variability between datasets inﬂuences the diﬃculty in modelling and the resulting residuals. Since model residuals are inﬂuenced by the variability in data, it is also important to include a measure of data variability in the denominator term of normalized statistical performance metrics. Otherwise, the numerator and denominator may be out of proportion that can create the illusion of an inﬂated or deﬂated metric value. Thus, measures of variability such as the range (in Equation (5)) and standard deviation (in Equation (1)) are preferred for normalizing the residuals as compared to measures of central tendencies such as the mean (in Equation (3)) or the sum (in Equation (4)) of the baseline datasets. The theoretical points discussed in this section suggest that RN_RMSE and R2 can not only provide more accurate estimates of model performance but also allow fair comparison between models developed on diﬀerent datasets. Therefore, this paper suggests that RN_RMSE should be used as the primary metric for validation of EMs instead of CVRMSE. R2 should be used as a secondary metric for validation of EMs, the reason for which is explained in Section 5.4. 4. Methodology In Section 3, the advantages of using RN_RMSE, and R2 for testing the predictive performance of EMs are explained at a conceptual level. To further aﬃrm the advantages, a comparative testing methodology is employed using synthetic energy consumption data as described previously in Section 1. The synthetic data are obtained from EnergyPlus simulations with prototype IDFs representing small- and medium-size oﬃce buildings, which are elaborated in Section 4.1. The technique adopted in this research to prepare and compare multiple sets of data points for each building under study is explained in Section 4.2. The visualization technique that is used to identify inadequacies of existing statistical metrics is explained in Section 4.3. 4.1. EnergyPlus prototype IDFs EnergyPlus is a simulation engine for energy modelling in buildings, which is supported by the United States Department of Energy and promoted through the Building and Technology Program of the Energy Eﬃciency and Renewable Energy Oﬃce. EnergyPlus is well known as an eﬃcient tool in the building energy analysis community that combines the best capabilities and features from BLAST and DOE-2 along with various new capabilities (Crawley et al. 2001; EnergyPlus 2017). Applications of 7 Journal of Building Performance Simulation Downloaded by [UAE University] at 16:00 25 October 2017 Figure 3. Visualizing bad vs. good models using scatter plot between modelled and baseline values: (a) bad model and (b) good model. EnergyPlus include load calculation, energy simulation, building performance simulation, energy performance, heat, and mass balance that can be used to model cooling, heating, ventilating, lighting, and water consumption in buildings (Fumo, Mago, and Luck 2010). EnergyPlus is also open sourced and free to use and can be downloaded online.4 U.S. Department of Energy’s Building Technologies Program in collaboration with the Paciﬁc Northwest National Laboratory, Lawrence Berkeley National Laboratory, and National Renewable Energy Laboratory have developed and made available prototype IDFs for diﬀerent types of buildings in various locations representing all U.S. climate zones. Synthetic energy datasets are generated in EnergyPlus with prototype IDFs for smalland medium-size oﬃce buildings, which are located in Fairbanks (subarctic climate), Phoenix (hot and dry climate), and San Francisco (warm and marine climate). The small oﬃce building is rectangular in shape with a total area of 510.97 m2 covering 1 ﬂoor with four perimeter zones, one core zone, and an attic zone. The window to wall ratio is 24.4% for South and 19.8% for the other three orientations. Air-source heat pump is used for conditioning the building and a gas furnace is used as a backup for additional heating requirement. Conditioned air is supplied to the building through constant air volume units. The medium oﬃce building is also rectangular in shape with a total area of 4979.6 m2 covering three ﬂoors where each ﬂoor has four perimeter zones and one core zone. The window-to-wall ratio is 33%. Packaged airconditioning unit (PACU) is used for cooling requirements and a gas furnace inside the PACU is used for heating requirements. Conditioned air is supplied to the building through variable air volume units with damper and electric reheating coil. 4.2. Adequacy of various statistical performance metrics As mentioned in Section 1, this paper studies the adequacy of various statistical metrics to validate the performance of system-level EMs, because system-level energy modelling is often necessary for in-depth analysis of energy uses and identify and eliminate energy wastage. All properties of both small- and medium-size prototype building IDFs are kept unchanged except for the time step object that is set as 60 min (1 h), 30 min, 15 min, and 1 min, so as to generate four diﬀerent models for each energy consumption type (cooling and heating) for each of the three building locations mentioned above. Therefore, in total 2 × 4 × 2 × 3 = 48 EMs are developed during this research. The time step object speciﬁes the time interval between successive zone heat and mass balance calculations in EnergyPlus simulation software. Usually, the lesser the time step, the longer it takes for the software to run but Table 2. Performance testing results for cooling electricity consumption models for small oﬃce buildings. Metrics Comparison Baseline – Model 1 MBEa (%) Baseline – Model 2 Baseline – Model 3 Baseline – Model 1 CVRMSEa Baseline – Model 2 (%) Baseline – Model 3 Baseline – Model 1 RMSE Baseline – Model 2 (kWh) Baseline – Model 3 Baseline – Model 1 RN_RMSE Baseline – Model 2 (%) Baseline – Model 3 Baseline – Model 1 R2 (0–1) Baseline – Model 2 Baseline – Model 3 San Fairbanks Phoenix Francisco 6.283 3.804 3.841 19.958 13.643 11.519 0.036 0.025 0.021 1.020 0.697 0.589 0.996 0.998 0.999 2.934 1.864 1.898 13.563 7.036 4.564 0.142 0.074 0.048 1.679 0.871 0.565 0.995 0.999 0.999 9.212 5.504 5.301 25.222 15.102 12.242 0.061 0.036 0.029 1.593 0.954 0.773 0.989 0.996 0.997 Note: RMSE, CVRMSE, MBE and RN_RMSE are estimates of model accuracy and lower values are desirable. R2 is also an estimate of model accuracy and 1 indicates a perfect ﬁt, whereas 0 or negative values indicate poor ﬁt of models. (Note: 8760 data points were present in each of the test sets that were used to evaluate model performance.) a Prescribed in ASHRAE Guideline 14. 8 D. Chakraborty and H. Elzarka generates more accurate results.5 Thus in this paper, the energy consumption data generated using prototype IDFs with one-minute time step is treated as the more accurate baseline data that is compared against the energy consumption data obtained from 15, 30 and 60 min time steps. Such comparisons aid in understanding the adequacy of various statistical performance metrics. Henceforth, in this paper, the developed EMs corresponding to diﬀerent time steps are referred by the following names: In Section 5, the hourly cooling and heating energy consumption data from models 1, 2, and 3 are compared to the baseline data using various statistical performance metrics mentioned in Sections 2 and 3. Therefore, in this paper, nine diﬀerent comparison reports are presented (3 models × 3 building locations) for each energy consumption type and each building type. The statistical metrics are subsequently evaluated by comparing their relative values Downloaded by [UAE University] at 16:00 25 October 2017 • 1 min time step – Baseline • 60 min (1 h) time step – Model 1 • 30 min time step – Model 2 • 15 min time step – Model 3 Figure 4. Cooling electricity consumption for small oﬃce buildings located in Fairbanks, Phoenix and San Francisco: (a) Fairbanks cooling, (b) Phoenix cooling and (c) San Francisco cooling. Note: 1 minute time step – Baseline. 60 min (1 hour) time step – Model 1. 30 min time step – Model 2. 15 min time step – Model 3. The axes are diﬀerent for diﬀerent cities. 9 Journal of Building Performance Simulation Downloaded by [UAE University] at 16:00 25 October 2017 and graphically visualizing the diﬀerence in the data using scatter plots. 4.3. Visualizing model performance using scatter plots Testing model predictions is a critical step in energy model validation. Scatter plots of modelled vs. baseline values is a common and reliable way to evaluate model predictions (Piñeiro et al. 2008). Plotting the data and showing the dispersion of the values is one way to ﬁnd out how much the modelled values vary from the baseline values. In an ideal world, modelled values would be exactly equal to the baseline values, that is the relationship between modelled data (x) and baseline data (y) can be represented as y = x. However, in the real world, modelled data often deviate from baseline data, therefore, the primary objective is to always keep the dispersion of data points around the y = x line to a minimum. This is the fundamental statistical principle that will be used in this paper to comparatively evaluate the adequacy of various statistical metrics. For example, the diﬀerence between a bad and a good model is illustrated in Figure 3(a) and 3(b), respectively. Although both models consist of the same number of data points, the dispersion of these points from the y = x line in Figure 3(a) is much larger than that in Figure 3(b), which suggests that the modelled values in Figure 3(a) are inaccurate relative to those in Figure 3(b). This visualization technique aids in an adequate estimation of relative model performance since it allows detailed observation of the deviation for each individual data point. In contrast, statistical metrics are summarized representation of the deviations between modelled and baseline data points due to which some information about the model performance is lost. Therefore, comparing the resulting metric values with detailed scatter plots between modelled and baseline data points can provide meaningful insight regarding the adequacy of various statistical metrics. Although scatter plots are one of the most powerful and widely used techniques for visual data exploration (Keim et al. 2010), other methods may also be appropriate depending upon the statistical context. It is important to mention that statistical metrics are most useful for computational purposes such as automatic selection of the best model from a variety of diﬀerent models and also to set threshold limits in codes and standards. Therefore, it is inadvisable to overlook statistical metrics and simply rely on visualization techniques to evaluate the performance of EMs. 5. Results and discussion As discussed in Sections 4, a comparative model assessment technique is adopted to determine the adequacy of statistical performance metrics used by researchers to test the predictive performance of EMs. The statistical metric values corresponding to each individual model are calculated using Equations (1)–(5). 5.1. Case 1a: Small oﬃce building simulation results and discussion for cooling energy The statistical metric values based on simulation results are tabulated in Table 2 and the scatter plots of modelled vs. baseline values are shown in Figure 4. The following points can be clearly inferred by analysing and comparing the metric values in Table 2 with corresponding scatter plots in Figure 4 as explained in Section 4.3: (1) For all three regions, model 3 is more accurate than model 2, which in turn is more accurate than model 1. This is also intuitive since model 3 has a time step that is closer to the time step of the baseline model as compared to the other two. However, this intrinsic property of the models cannot be inferred from the MBE values. As it can be seen that as per MBE values for Fairbanks and Phoenix, model 2 appears to be more accurate than model 3. Except for MBE, all other metrics clearly depict this property. (2) The scatter plot for Phoenix model 1 clearly shows more deviation of data points from y = x line as compared to Fairbanks model 1 and San Francisco model 1. Nonetheless, as per the CVRMSE and MBE values, Phoenix model 1 appears to be the most accurate among all three, which is clearly an incorrect interpretation of model performance. Table 3. Performance testing results for heating electricity consumption models for small oﬃce buildings. Metrics Comparison San Fairbanks Phoenix Francisco Baseline – Model 1 0.196 MBEa (%) Baseline – Model 2 − 0.078 Baseline – Model 3 − 0.816 Baseline – Model 1 65.654 CVRMSEa Baseline – Model 2 36.575 (%) Baseline – Model 3 18.453 Baseline – Model 1 0.249 RMSE Baseline – Model 2 0.139 (kWh) Baseline – Model 3 0.070 Baseline – Model 1 3.558 RN_RMSE Baseline – Model 2 1.982 (%) Baseline – Model 3 1.000 Baseline – Model 1 0.930 R2 (0–1) Baseline – Model 2 0.978 Baseline – Model 3 0.994 − 0.157 − 1.260 − 3.213 − 2.715 − 4.726 − 3.982 107.094 63.717 68.559 41.303 54.857 30.320 0.022 0.028 0.014 0.018 0.011 0.013 0.686 0.806 0.439 0.522 0.352 0.383 0.985 0.988 0.994 0.995 0.996 0.997 Note: RMSE, CVRMSE, MBE and RN_RMSE are estimates of model accuracy and a lower value is desirable. R2 is also an estimate of model accuracy and 1 indicates a perfect ﬁt whereas 0 or negative values indicate poor ﬁt of models. (Note: 8760 data points were present in each of the test sets that were used to evaluate model performance.) a Prescribed in ASHRAE Guideline 14. 10 D. Chakraborty and H. Elzarka corresponding values are very close to each other, this metric does not provide incorrect estimate of relative model performance. (6) It can be critically argued that having a lower CVRMSE for Phoenix is somehow justiﬁable because the average cooling energy consumption is relatively higher in Phoenix compared to the other two regions. However, as explained in Section 3, just because the average energy consumption of a dataset is relatively higher, does not necessarily indicate that the modelling task is more diﬃcult or the dataset is more complicated Downloaded by [UAE University] at 16:00 25 October 2017 (3) Since RMSE is not a normalized statistical metric, it can only be compared between models whose errors are measured in the same scale and units. In other words, it cannot account for multiplicative diﬀerences between datasets as mentioned previously in Section 3. (4) RN_RMSE does not suﬀer from the limitations mentioned in points 1–3 and can provide an accurate interpretation of corresponding model performance. (5) Although it is hard to clearly distinguish between these models’ performance using R2 as its Figure 5. Heating electricity consumption for small oﬃce buildings located in Fairbanks, Phoenix and San Francisco. Note: 1 minute time step – Baseline. 60 min (1 hour) time step – Model 1. 30 min time step – Model 2. 15 min time step – Model 3. The axes are diﬀerent for diﬀerent cities. (a) Fairbanks heating, (b) Phoenix heating and (c) San Francisco heating. Journal of Building Performance Simulation or the resulting models are relatively diﬀerent in terms of performance. Therefore, it is clear that CVRMSE will provide unfair estimates of model performance developed on a dataset that has lower relative average energy consumption value. This phenomenon is even more apparent when EMs are developed on heating gas/electricity consumption data because heating requirement often varies widely over the year that creates low average values for datasets with high variability. This results in very high CVRMSE values that can mislead analysts to discard good EMs (see Section 5.2 and 5.4 for more details). Downloaded by [UAE University] at 16:00 25 October 2017 5.2. Case 1b: Small oﬃce building simulation results and discussion for heating energy The statistical metric values based on simulation results are tabulated in Table 3 and the scatter plots of modelled vs. baseline values are shown in Figure 5. Analysis details for this case are provided as follows: (1) The argument against MBE provided in Section 5.1 also holds true in this case. As shown in Table 3 and Figure 5, no patterns emerge from the MBE values which would suggest that Model 3 is the closest and Model 1 is the farthest from the baseline for each location. Thereby (2) (3) (4) (5) Table 4. Performance testing results for cooling electricity consumption models for medium oﬃce buildings. Metrics Comparison Baseline – Model 1 MBEa (%) Baseline – Model 2 Baseline – Model 3 Baseline – Model 1 CVRMSEa Baseline – Model 2 (%) Baseline – Model 3 Baseline – Model 1 RMSE Baseline – Model 2 (kWh) Baseline – Model 3 Baseline – Model 1 RN_RMSE Baseline – Model 2 (%) Baseline – Model 3 Baseline – Model 1 R2 (0–1) Baseline – Model 2 Baseline – Model 3 San Fairbanks Phoenix Francisco 8.472 8.562 7.271 54.685 40.567 29.326 1.233 0.915 0.661 2.569 1.906 1.378 0.966 0.981 0.990 10.485 10.291 7.514 35.236 24.870 17.169 6.982 4.928 3.402 6.304 4.449 3.072 0.917 0.959 0.980 12.513 10.389 9.136 42.000 28.342 22.101 1.428 0.963 0.751 3.171 2.140 1.669 0.951 0.978 0.986 Note: RMSE, CVRMSE, MBE and RN_RMSE are estimates of model accuracy and a lower value is desirable. R2 is also an estimate of model accuracy and 1 indicates a perfect ﬁt whereas 0 or negative values indicate poor ﬁt of models. (Note: 8760 data points were present in each of the test sets that were used to evaluate model performance.) a Prescribed in ASHRAE Guideline 14. 11 providing misleading information about corresponding model performance. As mentioned in Section 5.1, CVRMSE can often provide a biased estimate of model performance. This point is further consolidated by the results provided in Table 3 and Figure 5. It is important to notice the inﬂated CVRMSE values are for Phoenix and San Francisco, although there is no evidence of such inaccurate model estimation from the scatter plots provided in Figure 5. Since Phoenix and San Francisco do not require heating throughout the year, the average yearly consumption value that is the denominator term in Equation (3) is very low, which caused the CVRMSE value to be so high. RMSE clearly suggests that Model 3 is the closest and Model 1 is the farthest from the baseline for each location, which is also intuitive since the time step for Model 3 is closest and the time step for Model 1 is farthest from the baseline. Since RMSE is an absolute metric, it does not make sense to compare its corresponding values for diﬀerent locations. It can be seen from the RN_RMSE values in Table 3 and Figure 5 that this metric does not suﬀer from the limitations of MBE and CVRMSE mentioned in points 1 and 2. Thereby providing more clarity about model’s performance to analysts. Although R2 do not suﬀer from the limitations mentioned above, it is not particularly useful in this case because the corresponding values are very close to each other. Therefore, it can be diﬃcult for an analyst to distinguish clearly. 5.3. Case 2a: Medium oﬃce building simulation results and discussion for cooling energy It is theoretically explained in Section 3 that the proposed metric RN_RMSE can overcome the limitations of CVRMSE and MBE. The beneﬁts of the proposed metrics are further consolidated using simulation results from a medium-size oﬃce building. The statistical metric values based on simulation results for cooling energy consumption are tabulated in Table 4 and the scatter plots of modelled vs. baseline values are shown in Figure 6. Results in Table 4 and Figure 6 support the same inferences as those in Sections 5.1 and 5.2. For example: (1) MBE for Fairbanks Model 2 is higher than Fairbanks Model 1, thereby suggesting that Model 1 is closer to the baseline than Model 2. This is clearly not the case based on the visualization provided in Figure 6 as well the known time step of the corresponding models. (2) CVRMSE for Phoenix models are comparatively lower than other two locations but the scatter 12 D. Chakraborty and H. Elzarka 5.4. Case 2b: Medium oﬃce building simulation results and discussion for heating energy The statistical metric values based on simulation results for heating energy consumption in medium-size oﬃce building are tabulated in Table 5 and the scatter plots of modelled vs. baseline values are shown in Figure 7. It was brieﬂy mentioned in Section 3 that the proposed metric RN_RMSE should be used in tandem with R2 to correctly evaluate the performance EMs. Analysing this case provides some very interesting insights about the Downloaded by [UAE University] at 16:00 25 October 2017 plots in Figure 6 impart a diﬀerent picture. The scatter plots clearly reveal that the dispersion of data points from the y = x line is maximum for Phoenix models compared to the other locations (refer to Section 4.3 for explanation). Therefore, it is shown once again that CVRMSE provides incorrect estimate of models’ performance. (3) As expected, RN_RMSE and R2 depict correctly the relative performance of EMs in this case as well. Figure 6. Cooling electricity consumption for medium oﬃce buildings located in Fairbanks, Phoenix and San Francisco. Note: 1 minute time step – Baseline. 60 min (1 hour) time step – Model 1. 30 min time step – Model 2. 15 min time step – Model 3. The axes are diﬀerent for diﬀerent cities. (a) Fairbanks cooling, (b) Phoenix cooling and (c) San Francisco cooling. Journal of Building Performance Simulation Table 5. Performance testing results for heating gas consumption models for medium oﬃce buildings. Downloaded by [UAE University] at 16:00 25 October 2017 Metrics Comparison San Fairbanks Phoenix Francisco Baseline – Model 1 − 8.589 − 224.1 − 142.433 MBEa (%) Baseline – Model 2 − 5.749 − 202.717 − 109.047 Baseline – Model 3 − 4.643 − 187.53 − 94.67 Baseline – Model 1 16.704 1822.256 724.472 CVRMSEa Baseline – Model 2 10.756 1669.183 578.161 (%) Baseline – Model 3 8.412 1558.341 511.576 Baseline – Model 1 5.713 0.881 1.039 RMSE Baseline – Model 2 3.679 0.807 0.829 (kWh) Baseline – Model 3 2.877 0.753 0.734 Baseline – Model 1 2.107 6.521 4.34 RN_RMSE Baseline – Model 2 1.357 5.974 3.463 (%) Baseline – Model 3 1.061 5.577 3.064 Baseline – Model 1 0.992 − 1.197 0.114 R2 (0–1) Baseline – Model 2 0.996 − 0.844 0.436 Baseline – Model 3 0.998 − 0.607 0.558 Note: RMSE, CVRMSE, MBE and RN_RMSE are estimates of model accuracy and a lower value is desirable. R2 is also an estimate of model accuracy and 1 indicates a perfect ﬁt whereas 0 or negative values indicate poor ﬁt of models. (Note: 8760 data points were present in each of the test sets that were used to evaluate model performance.) a Prescribed in ASHRAE Guideline 14. capability of RN_RMSE when used together with R2 . The following points can be inferred from the results given in Table 5 and Figure 7: (1) By observing the MBE and CVRMSE values, at ﬁrst, it seems obvious that all models except for the Fairbanks models are unacceptable and therefore they must be discarded. These values are just absurdly high suggesting that the models are poor but do not provide any information to analysts about the reasons behind poor performance. (2) Interestingly, the RN_RMSE values are low, yet R2 values are negative and low for Phoenix and San Francisco models respectively. Statistically, it is well known that high bias, high variance or nonlinear relationship between modelled and baseline values can result in poor R2 values. Since the RN_RMSE (normalized RMSE) values are low, it is obvious that the models do not suﬀer from either high bias or high variance. Therefore, the only possibility is that there exist nonlinear relationships between the modelled and baseline values. This is indeed true, which can be validated by observing the scatter plots in Figure 7. Readers may refer to any elemental statistical learning textbook, such as James et al. (2013) to understand the basics of bias, variance, and model error. Therefore, these two metrics (RN_RMSE, and R2 when used together 13 can reveal important and interesting insights about the performance of EMs to analysts. (3) MBE and CVRMSE serve as pointless measures in this case as they do not provide any information to the analyst about the problems in models. In contrast, using RN_RMSE, and R2 together can prove to be beneﬁcial for the analyst. For example, consider Phoenix Model 3 in this case. Since a low RN_RMSE value combined with a negative R2 value possibly suggest a nonlinear relationship between model and baseline, an analyst may decide to apply a post-processing technique to the modelled values for improving model performance. For instance, by passing the modelled values through a second order equation (‘0.0246 × X 2 + 0.05323 × X − 0.0006672’), an analyst can signiﬁcantly improve the performance of this model. MBE, CVRMSE, RMSE, RN_RMSE, and R2 of the improved model is found to be − 0.01 %, 122.837%, 0.059 kWh, 0.44% and 0.99, respectively. The second order equation used above for post-processing the modelled values is obtained using the Curve ﬁtting toolbox in MATLAB.6 There exist several post-processing techniques to improve the performance of resulting models, discussing them is beyond the scope of this research paper. 6. Conclusion Testing the predictive performance (also called validation) of EMs is an important step, which requires as much attention as simulation programs, algorithms, and model development. The widely used statistical metrics for validating calibrated EMs were proposed a few decades ago from the perspective of whole-building measurement and veriﬁcation of energy conservation measures. A recent shift in trend towards system-level energy modelling demands better and more reliable statistical metrics to test the predictive performance of resulting models. This paper demonstrates that metrics such as RMSE, CVRMSE, and MBE are not as reliable as other measures, especially for validation of system-level EMs. CVRMSE and MBE cannot normalize additive diﬀerences between datasets. Also, MBE suffers from the cancellation of positive and negative error, and hence it cannot be used to evaluate the total variance in error of resulting models. RMSE is not an unitless/dimensionless metric, and thus it cannot be used to compare the predictive performance of EMs based on diﬀerent datasets. Ideally, a statistical performance metric should be scale and unit invariant; should not be prone to overestimation or underestimation of the performance of EMs; should be universally applicable for all types of EMs. Unfortunately, as illustrated in the paper, none of the widely used statistical metrics satisfy all these criteria. Therefore, D. Chakraborty and H. Elzarka Downloaded by [UAE University] at 16:00 25 October 2017 14 Figure 7. Heating energy consumption for medium oﬃce buildings located in Fairbanks, Phoenix and San Francisco. Note: 1 minute time step – Baseline. 60 min (1 hour) time step – Model 1. 30 min time step – Model 2. 15 min time step – Model 3. The axes are diﬀerent for diﬀerent cities. (a) Fairbanks heating, (b) Phoenix heating and (c) San Francisco heating. an alternative metric named range normalized root mean squared error (RN_RMSE) is proposed in this paper. It is shown that RN_RMSE can successfully normalize multiplicative as well as additive diﬀerences between datasets. Also, RN_RMSE do not suﬀer from overestimation and underestimation problems as the denominator, which is a measure of the variability of the data, is proportional to the numerator, which is a measure of variability of the residuals. Finally, it is suggested that R2 is used along with RN_RMSE to identify any existing nonlinear relationship between the modelled results and baseline data. 7. Future work It should be pointed out that data from real buildings often suﬀer from measurement-related inaccuracies that may harm the accuracy of the models. Therefore, data preprocessing is an essential step that must be carried out before energy data analysis and modelling. In this research, the necessity for data preprocessing was eliminated by using synthetic data as a surrogate for sensor data from real buildings. Future research may include evaluating the adequacy of the proposed metric using measured data from real buildings. Downloaded by [UAE University] at 16:00 25 October 2017 Journal of Building Performance Simulation The notion of having an absolute cut-oﬀ criterion for statistical metrics is baseless because case-speciﬁc energy modelling tasks are unique. In some cases, the energy consumption may be important whereas, in others, peak load or time of occurrence may be more critical. Also, cases may vary for diﬀerent building or system types. Therefore, it is necessary to group cut-oﬀ criteria based on similarity of cases. Future research may also include (1) synthesizing larger and case speciﬁc datasets. (2) Using those datasets to explore the range of values that the proposed statistical metrics can acquire when diﬀerent algorithms are applied; this will allow us to evaluate the range of metric values that correspond to good EMs developed using superior algorithms. (3) Develop a way to determine cut-oﬀ criteria that must be met subject to speciﬁc applications, such as M&V, predictive analysis and fault detection. In this paper, we have only considered system-level energy consumption types. According to our preliminary analysis, whole building energy consumption proﬁles also vary additively from one another. Therefore, the proposed alternative metrics are also likely to be more useful for testing the predictive performance of whole-building energy models. Future research may include further analysis regarding this. Notes 1. https://energy.gov/eere/buildings/about-building-energymodeling 2. https://www.energycodes.gov/development/commercial/ prototype_models 3. https://www.mathworks.com/help/curveﬁt/least-squares-ﬁt ting.html 4. https://energyplus.net 5. http://bigladdersoftware.com/epx/docs/8-0/input-outputreference/page-006.html 6. https://www.mathworks.com/help/curveﬁt/curve-ﬁtting.html Disclosure statement No potential conﬂict of interest was reported by the authors. ORCID Debaditya Chakraborty 9440 http://orcid.org/0000-0002-2165- References Al-Homoud, M. S. 2001. “Computer-Aided Building Energy Analysis Techniques.” Building and Environment 36 (4): 421–433. ASHRAE. 2013. Fundamentals Handbook. IP Edition. Atlanta, GA: ASHRAE. ASHRAE. 2014a. “Guideline 14-2014, Measurement of Energy, Demand and Water Savings.” American Society of Heating, Ventilating, and Air Conditioning Engineers, Atlanta, GA. ASHRAE. 2014b. “Standard 140-2014: Standard Method of Test for the Evaluation of Building Energy Analysis Computer Programs.” ASHRAE, Atlanta. Aydinalp-Koksal, M., and V. I. Ugursal. 2008. “Comparison of Neural Network, Conditional Demand Analysis, and 15 Engineering Approaches for Modeling End-Use Energy Consumption in the Residential Sector.” Applied Energy 85 (4): 271–296. Aydinalp-Koksal, M., V. I. Ugursal, and A. S. Fung. 2002. “Modeling of the Appliance, Lighting, and Space-Cooling Energy Consumptions in the Residential Sector using Neural Networks.” Applied Energy 71 (2): 87–110. Ben-Nakhi, A. E., and M. A. Mahmoud. 2004. “Cooling Load Prediction for Buildings using General Regression Neural Networks.” Energy Conversion and Management 45 (13): 2127–2141. Bloomﬁeld, D. P. 1999. “An Overview of Validation Methods for Energy and Environmental Software.” ASHRAE Transactions 105: 685. Clarke, J., J. Cockroft, S. Conner, J. Hand, N. Kelly, R. Moore, T. O’Brien, and P. Strachan. 2002. “Simulation-Assisted Control in Building Energy Management Systems.” Energy and Buildings 34 (9): 933–940. Crawley, D. B., L. K. Lawrie, F. C. Winkelmann, W. F. Buhl, Y. J. Huang, C. O. Pedersen, R. K. Strand, et al. 2001. “Energyplus: Creating a New-Generation Building Energy Simulation Program.” Energy and Buildings 33 (4): 319– 331. Deb, C., L. S. Eang, J. Yang, and M. Santamouris. 2016. “Forecasting Diurnal Cooling Energy Load for Institutional Buildings using Artiﬁcial Neural Networks.” Energy and Buildings 121: 284–297. Dhar, A., T. A. Reddy, and D. Claridge. 1999. “A Fourier Series Model to Predict Hourly Heating and Cooling Energy Use in Commercial Buildings with Outdoor Temperature as the Only Weather Variable.” Journal of Solar Energy Engineering 121 (1): 47–53. Dodge, Y. 2003. The Oxford Dictionary of Statistical Terms. Oxford: Oxford University Press on Demand. Dong, B., C. Cao, and S. E. Lee. 2005. “Applying Support Vector Machines to Predict Building Energy Consumption in Tropical Region.” Energy and Buildings 37 (5): 545– 553. EnergyPlus. 2017. “The Encyclopedic Reference to Energyplus Input and Output.” https://energyplus.net/sites/default/ﬁles/ pdfs/pdfs_v8.3.0/InputOutputReference.pdf. Fumo, N., P. Mago, and R. Luck. 2010. “Methodology to Estimate Building Energy Consumption using Energyplus Benchmark Models.” Energy and Buildings 42 (12): 2331– 2337. Garrett, A., and J. New. 2016. “Suitability of Ashrae Guideline 14 Metrics for Calibration.” ASHRAE Transactions 122 (1): 469–477. Gestwick, M. J., and J. A. Love. 2014. “Trial Application of Ashrae 1051-rp: Calibration Method for Building Energy Simulation.” Journal of Building Performance Simulation 7 (5): 346–359. Granderson, J., and P. N. Price. 2014. “Development and Application of a Statistical Methodology to Evaluate the Predictive Accuracy of Building Energy Baseline Models.” Energy 66: 981–990. Jacob, D., S. Dietz, S. Komhard, C. Neumann, and S. Herkel. 2010. “Black-Box Models for Fault Detection and Performance Monitoring of Buildings.” Journal of Building Performance Simulation 3 (1): 53–62. James, G., D. Witten, T. Hastie, and R. Tibshirani. 2013. An Introduction to Statistical Learning, Vol. 112. New York: Springer. Judkoﬀ, R., D. Wortman, B. O’doherty, and J. Burch. 1983. “A Methodology for Validating Building Energy AnalysisSimulations.” Tech. Rep., Report TR-254-1508, Solar Energy Research Institute. Downloaded by [UAE University] at 16:00 25 October 2017 16 D. Chakraborty and H. Elzarka Kandil, A.-E., and J. A. Love. 2014. “Signature Analysis Calibration of a School Energy Model Using Hourly Data.” Journal of Building Performance Simulation 7 (5): 326–345. Ke, M.-T., C.-H. Yeh, and J.-T. Jian. 2013. “Analysis of Building Energy Consumption Parameters and Energy Savings Measurement and Veriﬁcation by Applying Equest Software.” Energy and Buildings 61: 100–107. Keim, D. A., M. C. Hao, U. Dayal, H. Janetzko, and P. Bak. 2010. “Generalized Scatter Plots.” Information Visualization 9 (4): 301–311. Kwok, S. S., R. K. Yuen, and E. W. Lee. 2011. “An intelligent Approach to Assessing the Eﬀect of Building Occupancy on Building Cooling Load Prediction.” Building and Environment 46 (8): 1681–1690. Lam, J. C., K. K. Wan, S. Wong, and T. N. Lam. 2010. “Principal Component Analysis and Long-Term Building Energy Simulation Correlation.” Energy Conversion and Management 51 (1): 135–139. Lee, W.-Y., J. M. House, and N.-H. Kyong. 2004. “Subsystem Level Fault Diagnosis of a Building’s Air-Handling Unit using General Regression Neural Networks.” Applied Energy 77 (2): 153–170. Pan, Y., Z. Huang, and G. Wu. 2007. “Calibrated Building Energy Simulation and Its Application in a High-Rise Commercial Building in Shanghai.” Energy and Buildings 39 (6): 651– 657. Piñeiro, G., S. Perelman, J. P. Guerschman, and J. M. Paruelo. 2008. “How to Evaluate Models: Observed vs. Predicted or Predicted vs. Observed?.” Ecological Modelling 216 (3): 316–322. Reddy, T. A. 2006. “Literature Review on Calibration of Building Energy Simulation Programs: Uses, Problems, Procedures, Uncertainty, and Tools.” ASHRAE Transactions 112 (1): 226–240. Reddy, T. A., and D. E. Claridge. 2000. “Uncertainty of “Measured” Energy Savings from Statistical Baseline Models.” HVAC&R Research 6 (1): 3–20. Salkind, N. J. 2007. Encyclopedia of Measurement and Statistics, Vol. 2. Thousand Oaks, CA: Sage. Torcellini, P., M. Deru, B. Griﬃth, K. Benne, M. Halverson, D. Winiarski, and D. Crawley. 2008. “DOE Commercial Building Benchmark Models.” ACEEE 2008 Summer Study on Energy Eﬃciency in Buildings. NREL Conference Paper NREL/CP-550-43291, 17–22. http://www.nrel. gov/docs/fy08osti/43291.pdf. Wetter, M.. 2011. “A View on Future Building System Modeling and Simulation.” In Building Performance Simulation for Design and Operation, edited by J. L. Hensen and R. Lamberts, 481–503. London: Routledge. Zhang, Y., Z. O’Neill, B. Dong, and G. Augenbroe. 2015. “Comparisons of Inverse Modeling Approaches for Predicting Building Energy Performance.” Building and Environment 86: 177–190. Zhao, H.-X., and F. Magoulès. 2012. “A Review on the Prediction of Building Energy Consumption.” Renewable and Sustainable Energy Reviews 16 (6): 3586–3592.