Shedding Light on the Role of Sample Sizes and Splitting Proportions in Out-of-Sample Tests: A Monte Carlo Cross-Validation Approach

Christian Janze

Resumo


We examine whether the popular 2/3 rule-of-thumb splitting criterion used in out-of-sample evaluation of predictive econometric and machine learning models makes sense. We conduct simulations regarding the predictive performance of the logistic regression and decision tree algorithm when considering varying splitting points as well as sample sizes. Our non-exhaustive repeated random sub-sampling simulation approach known as Monte Carlo cross-validation indicates that while the 2/3 rule-of-thumb works, there is a spectrum of different splitting proportions that yield equally compelling results. Furthermore, our results indicate that the size of the complete sample has little impact on the applicability of the 2/3 rule-of-thumb. However, our analysis reveals that when considering relatively small and relatively large training samples in relation to the sample size, the variation of the predictive accuracy can lead to misleading results. Our results are especially important for IS researchers considering the usage of out-of-sample methods for evaluating their predictive models.

Texto Completo:

PDF (English)

Referências


Allen, D. M. (1974). The Relationship Between Variable Selection and Data Agumentation and a Method for Prediction. Technometrics, 16(1), 125–127. https://doi.org/10.1080/00401706.1974.10489157

Campbell, J. Y., Lo, A. W., & MacKinlay, A. C. (2012). The Econometrics of Financial Markets (2nd ed.). Princeton University Press.

Cios, K. J., Pedrycz, W., Swiniarski, R. W., & Kurgan, L. (2007). Data Mining: A Knowledge Discovery Approach. Springer Science & Business Media.

Dobbin, K. K., & Simon, R. M. (2011). Optimally splitting cases for training and testing high dimensional classifiers. BMC Medical Genomics, 4(1), 1.

Dubitzky, W., Granzow, M., & Berrar, D. P. (2007). Fundamentals of Data Mining in Genomics and Proteomics. Springer Science & Business Media.

Geisser, S. (1975). The Predictive Sample Reuse Method with Applications. Journal of the American Statistical Association, 70(350), 320–328. https://doi.org/10.1080/01621459.1975.10479865

Inoue, A., & Kilian, L. (2005). In-Sample or Out-of-Sample Tests of Predictability: Which One Should We Use? Econometric Reviews, 23(4), 371–402. https://doi.org/10.1081/ETC-200040785

Kaynak, O., Alpaydin, E., Oja, E., & Xu, L. (2003). Artificial Neural Networks and Neural Information Processing — ICANN/ICONIP 2003: Joint International Conference ICANN/ICONIP 2003, Istanbul, Turkey, June 26–29, 2003, Proceedings. Springer.

Lantz, B. (2015). Machine Learning with R - Second Edition: Amazon.de: Brett Lantz: Fremdsprachige Bücher. Packt Publishing. Retrieved from http://www.amazon.de/Machine-Learning-R-Second-Edition/dp/1784393908

Picard, R. R., & Cook, R. D. (1984). Cross-Validation of Regression Models. Journal of the American Statistical Association, 79(387), 575–583. https://doi.org/10.2307/2288403

Schneider, J. (1997, February 7). Cross Validation. Retrieved March 30, 2016, from http://www.cs.cmu.edu/~schneide/tut5/node42.html

Shao, J. (1993). Linear Model Selection by Cross-Validation. Journal of the American Statistical Association, 88(422), 486. https://doi.org/10.2307/2290328

Shmueli, G., & Koppius, O. R. (2011). Predictive analytics in information systems research. Mis Quarterly, 553–572.

Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society. Series B (Methodological), 111–147.

Zhang, Y., & Yang, Y. (2015). Cross-validation for selecting a model selection procedure. Journal of Econometrics, 187(1), 95–112. https://doi.org/10.1016/j.jeconom.2015.02.006




DOI: http://dx.doi.org/10.18803/capsi.v17.245-259

Apontamentos

  • Não há apontamentos.