A Sampling of Statistical Models (Part 2)
In part 1 of this series, we discussed what a model is. If you haven’t done so, please read that blog before starting this one. In this post, we focus on what makes a model statistical. We begin by discussing what it means to specify and estimate a model. Having done so, we describe how statistics enters the picture to make a model statistical. After reading this post, you will know what statistical models are and how to use them. Let’s begin!
Model Specification Entails All Factors That Impact How Well A Model Represents A Data Generating Process
Get closer to the truth
It helps to start with how to interpret the results of a statistical model. Assume that we have specified and estimated a statistical model. And the result is such that a unit increase in the number of years of education is associated with a $1000 increase in the annual income of a person. How do we interpret this result? We say that “if the model is true, then we expect a $1000 dollar increase in annual income for every year of education completed.”
This raises the question of what the word “true” means in this context. Being a representation of a data generating process, a model by definition cannot be the data generating process itself. Anymore than a map of a country can be the country it represents. However, and more practically, we can always ask whether a model represents something well or not. This is a matter of degree and is what the word “true” means. A model is “true” when it represents a data generating process well. And, as you may have guessed, a model that doesn’t represent a data generating process well is no more useful than a map that represents a wrong location.
But watch out for potential problems
What, then, are the factors that can enhance or detract from how well a model represents a data generating process? They include but are not limited to whether the model is appropriate for the dependent variable in question, whether the functional form reasonably captures how the dependent variable changes with the independent variables, and whether certain independent variables have been left out of the model. There are resources that address these issues in depth. However, the key takeaway is that model specification is about how investigators identify and address factors that impact how well a model represents a data generating process.
At Social Scientific, a combination of our clients’ industrial experience and the extant literature informs model specification. This improves statistical inference by helping to ensure that our statistical models capture data generating processes as much as possible.
Estimating A Model Means Calculating Its Parameters Using Data
Take advantage of technology
In the section about model specification, we mentioned that the result of our model was such that a unit increase in the number of years of education is associated with a $1000 increase in annual income. How exactly did we get this result? This is the domain of model estimation. In model estimation, we collect data (in this example, on education, income, and, say, family wealth) from a sample of some population of interest, and apply the tools of mathematics (usually, multivariate calculus, matrix algebra, and statistics) to identify patterns (correlations, error terms, intercepts, etc. – also known as parameters) in the data.
However, with the advent of computer software and programming, it’s much more efficient to feed the data into a computer program, provide specific instructions, and let the computer find the pattern with great dispatch. By the way, computing software by no means obviates a good understanding of the issues pertaining to estimation.
But know that technology can’t think for you.
What, then, are some of the issues with model estimation? There are many but for the purposes of this blog we focus on the generalizability of the estimated results to the population of interest. Note that the dataset we used to estimate the parameters of the model is one of many datasets that we could have used. If we collected data from a different sample, created a dataset out of that data, and used it to estimate the model, we are likely to get different results. To drive this point home, let’s think of a scenario. Assume that the United States Census has told us that the average height of men in Cincinnati (Ohio) is 6ft.
This doesn’t mean that if you go to Cincinnati and measure the height of the men there, each one of them will be 6ft. Indeed, what you’ll observe is that some of the men are less than 6ft tall and others are more than 6ft tall. Therefore, an average height of 6ft tall means that if you are asked the height of a man in Cincinnati, and you have no idea what that is, 6ft is more likely your best bet. Similarly, the results of a model is one of many results you could have gotten if you estimated the model repeatedly, using a different dataset each time. And the current result is a good guess of what the actual situation is like in the population.
So, if we can get different results each time we use a different dataset to estimate a statistical model, how can we be confident that our current result generalizes to the population of interest? Welcome to the domain of statistics, which is the focus of the next section.
A Model Becomes Statistical When Methods in Inferential Statistics Are Used To Estimate It
Known as the science of uncertainty, statistics helps us to make inferences about a population of interest using data you have collected and analyzed from a sample of that population. The power of statistics is akin to that of a medical blood test. Doctors need not draw all the blood from a patient to test for, say, cancer. Instead, they draw a small amount, conduct tests on it, and infer the disease one is suffering from using the test results.
There are two broad groups of statistical methods that you can use when you want to extrapolate a model’s results to a wider population of interest. These are parametric and non-parametric methods. We make more assumptions about the population from which we collected the data in the former than in the latter. These assumptions include, in particular, that they (the model’s error terms) follow a specific probability distribution. Also, there are benefits and costs to each type of method. Notable among them is that in cases where the data points in the population of interest are not normally distributed, non-parametric methods perform just as well as their parametric counterparts.
And in cases where they are, non-parametric methods provide relatively less precision. Examples of parametric methods for inferring model parameters include Ordinary Least Squares (OLS) with t-tests, Maximum Likelihood Estimation (MLE), and Bayesian Estimation (BE). For non-parametric methods, we have rank-based methods (asymptotically distribution-free rank-based test, estimators associated with the Theil statistic, etc.) and non-rank based methods (e.g., kernel regression smoother, spline smoother, local regression smoother, wavelet smoother, etc.) Although, specifics of these methods are beyond the scope of this blog, we hope they give you an idea of statistics and methods that you can use to extrapolate a model’s results to a broader population of interest.
References
Nonparametric Statistical Methods, Third Edition ; Author(s):. Myles Hollander, Douglas A. Wolfe, Eric Chicken, ; First published:17 July 2015.