Numerical variables describe data that arise from counts or measurements. A numerical variable is said to be discrete if data can only take specific values, such as parity of women or the number of complications following a procedure. Conversely, it is said to be continuous if data can take any value within a range, like height or length of hospital stay. The values in a discrete variable are usually integers and equally spaced. While theoretically the values in a continuous variable can take any fraction of a measurement, a finite set of values is usually possible and/or reasonable for relevant interpretation.
It is often useful to summarize data in just a few numbers that help reveal relevant information. Measures of central tendency of a distribution are usually good indicators of what can be expected from the sample as a whole. The mean or arithmetic mean is the average of the numbers in the set of data. It is calculated by dividing the sum of the values that make up the distribution by the number of data points. The truncated or trimmed mean refers to the mean calculated after discarding data points both at the low and high end. This can be a useful tool for comparison methods, but does no tell us much about the individual values in the set of data. The median is the value that is found directly at the midpoint of the distribution, which is why it is also said to be a measure of location. It is the 0.5 quantile or 50th percentile. This means that one half of the values in the distribution lie above the median and the other half below it. Unlike the mean, the median is not influenced by unusually small or large values, called outliers. A quantile divides the distribution into a set proportion of data points. Quartiles divide the distribution into four equal portions. The median is therefore the second quartile.
A distribution is said to be symmetrical if the distance from the highest value to the median is the same as the distance from the lowest value to the median. In this case, the mean and the median are pretty close. If, however, the distance is much greater on one side, then the distribution is said to be skewed. A distribution is positively skewed if the distance from the median to the highest value is longer than the distance to the lowest value; the tail on the right is longer. In a negatively skewed distribution the tail on the left is longer because the distance from the median to the lowest value is longer than the distance to the highest value. Kurtosis is a measure of tailedness of a distribution, which is highly influenced by outlier data points. A kurtosis greater than three indicates a distribution with long tails, while a kurtosis lower than three indicates a distribution with short tails. This statistic may be a useful indicator or normality of the distribution as a normal distribution has a kurtosis of three.
Measures of dispersion of a distribution tell us about the spread of individual data points. The range is the difference between the highest and the lowe st data point. A relevant disadvantage of the range is that it depends on extreme data points, possibly outliers. The interquartile range is the difference between the first and the third quartile or in other words the middle fifty percent. An outlier can be defined as a data point that is more than 1.5 times the length of the interquartile range either below the first quartile or above the third quartile. A disadvantage of the interquartile range is that it changes from sample to sample of the same population. Box plots are simple graphical representations of these parameters that provide a quick summary of the distribution. The box shows the interquartile distance, a horizontal line across the box represents the median, and the lines at either extreme represent the range. Outliers are shown as separate dots.
A preferred measured of dispersion is the standard deviation. This is because it provides information regarding the spread of data points about the mean. It is a relevant tool to interpret data drawn from a variable that follows a normal distribution. About 68% of data points will be expected to fall under the range covered by a standard deviation above and below the mean; 95% under the range of 2 standard deviations about the mean; and 99.7% under the range of 3 standard deviations about the mean. As a measure of scatter about the mean, the standard deviation is found by first calculating the difference between each data point and the mean. The resulting deviations from the mean will be both negative and positive so their sum would be zero. Since it is not a matter of directionality, but of size, we square the deviations before adding them. This gives the sum of squares about the mean. To get the mean of squares we divide the sum of squares by the number of data points minus one; the degrees of freedom of the variance estimate. This is because a minimum of two data points are needed to measure variability. In this way, if the number of data points happens to be 1, then it would give a denominator of 0, which reflects the impossibility of estimating variability from a single data point. This estimate of variability is called variance. Since it is calculated from the squares of the individual variations from the mean, its units differ from that of the original data. The square root of the variance, which will have the same units as the data, gives the standard deviation. The standard deviation of the sampling distribution of the sample mean is the standard error of the mean. The standard error is a measure of the precision of estimates. This can be thought of as the dispersion of the sample means from the population mean.