A Basic Review of Statistics Definitions and Concepts

 

Population: The universe of event numbers under study. 

Sample and sampling: A portion of the population used for statistical analysis. Sampling is the process by which numerical values will be selected from the population. Sample statistics, if they are unbiased, are economical ways to draw inferences about the larger population. The requirement here is that the sample drawn should be large enough to give an unbiased picture of the population.

Probability sampling: A selection process that considers the likelihood of selecting a particular number or attribute from the population.

Random sample: A sampling process whereby every item in the population has an equal chance of being chosen and placed in the sample.  Lack of a random sample may result in the estimated statistic(s) being biased. 

Variable: A numerical attribute that can take on different values. Variables constitute the characteristic of a sample set to which statistical analysis will be analyzed. There can be categorical or numerical variables. For example, a person's religion would be a categorical variable whereas that person's disposable income would be a numerical variable.

Parameter: A numerical measure describing a particular characteristic of a population. Since we typically do not study populations, parameters are often unobservable and must be estimated. An unbiased parameter estimate is one that is statistically equal to the true population parameter.

Statistic: A numerical measure that describes some property of the population. A statistic is obtained from a sample. We hope the statistic estimated from the sample is statistically equal to the same statistic if we could collect it from the population.

Published sources of data: Published sources of data are collected from either a primary or secondary source. Primary data is data collected by the primary analyst. Secondary data is data that has been collected from primary sources.

Experimental data: Data about a variable that has been collected by allowing only one (or a selected) group of variables to change. All other variables are held constant. Experimental data is typically seen in the hard sciences. Non-experimental data is typically seen in the social sciences where it is impossible to "hold everything else constant." 

Survey data: Data collected from the responses of a group of participants.

Frame data: Data collected using a pre-specified list establishing the guidelines that will be used in assembling the sample from the population.  Frames should be selected so that the resulting sample will represent the population.

Bar chart: A chart made from categorical data in which the heights of bars represent the frequency (or relative frequency aka percent) of membership in each value of the variable. Unlike a histogram, the width of the bars carries no meaning.

Histogram: A graph made from quantitative data in which the range of the data is divided into intervals called bins, and then bars are constructed above each bin such that the heights of the bars represent the frequency or relative frequency of data in the particular bin. Unlike a bar chart, the width of the bars is an important characteristic of the graph

Box and whisker plot: A plot that incorporates the median and upper and lower quartiles to graphically display the data range. Also particularly useful for displaying outliers when they are present in the data.

Time series plot: The plot of a specified variable over time.

Cross-sectional data: Data compared at one point in time. Comparisons can be intra-data or with a benchmark data point.

Probability: The mathematical likelihood a particular outcome will occur.

Probability distribution: A scaling of possible event outcomes based upon their likelihood (probability) of occurring and described by a probability function.

Discrete probability distribution: A probability distribution where each class contains only certain values of thei variable in any particular interval (such as only whole number values, for example).

Continuous probability distribution: A probability distribution described by any possible value of the variable within the range of possible values.

Symmetrical probability distribution: A probability function that has a vertical line of symmetry creating left/right mirror images. The most well-known example is the bell-shaped Normal distribution that is fully described by it's mean and standard deviation.

Left-skewed probability distribution: A set of data values in which the mean is generally less than the median. The left tail of the distribution is longer than the right tail of the distribution.

Right-skewed probability distribution: A set of data values in which the mean is generally greater than the median. The right tail of the distribution is longer than the left tail of the distribution.

Central Limit Theorem: The statistical law that states that regardless of the shape of the distribution of the individual values in the population, as the sample size gets larger, the sampling distribution of the mean can be approximated by a normal distribution.

Degree of freedom: The number of independent data values available to estimate the population's standard deviation. The degrees of freedom equal the number of observations in the sample (N) minus the number of parameters to be estimated (K).

Student's t-distribution: A family of curves each of which is a symmetrical bell-shaped distribution that has greater area in the tails than the normal probability distribution. Each distribution will be defined by its degrees of freedom. As the degrees of freedom increase the t-distribution approaches that of the normal distribution. 

Z-value: A statistic generated from a normal probability distribution. It is a standardized value in that it divides the difference between an observation and the mean value by the standard deviation of the observations.

A pie chart: A circular graph where wedge-shaped slices comprise proportions of the total circular graph.

Pareto chart: A bar chart that displays the count of each item as a number or percentage in ascending order from left to right. The Pareto function represents a cumulative percentage summing to 100%. 

Frequency: The number or percent occurrence of a particular outcome out of N trials.

Frequency table: A grouping of data into mutually exclusive classes showing the number of observations in each class. Relative frequency classes are derived from a frequency table by computing the percentage of the total observations made up by each class.

Joint frequency distribution: A table consisting of paired responses for two variables. 

Scatter diagram: A graph that plots the coordinates from two series of data points.  In a typical a scatter diagram the X axis (the horizontal axis) represents the units of one variable while the Y axis (the vertical axis) represents the units of the second variable.  Scatter diagrams can reveal patterns among data.

Mean: A measure of central tendency. It is computed by summing all data values and dividing by the number of data values summed. In this context the mean (average) is an ex post number. It is computed after-the-fact. If the observations include all the values in a population the average is referred to as a population mean. If the values used in the computation only include those from a sample, the result is referred to as a sample mean.

Expected mean: A measure of central tendency. All data values are weighted by their probability of occurring and then summed. The expected mean is an ex ante calculation (sometimes referred to as a weighted mean where the probabilities are the weights). The expected mean can be from a population or from a sample. Typically it is computed from a sample. The expected mean is also referred to as an expected value.

Median: A center value that divides the data array into two halves. The median is not affected by extreme observation values in the data set.

Mode: The value in the data set that occurs most frequently. Some data sets may have more than one mode if two different values tie for the most frequently occurring value. For example, a distribution of values may be bi-modal in nature.

Population variance: The population variance is the average of the squared differences of the data values from the mean value of observations divided by N observations.

Sample variance: The sum of the squared differences of the data values from the mean value of observations where this sum is divided by the number of observations (N) minus 1. We divide by N – 1 to correct for a bias produced in the sample variance when the number of observations is small.

Sample standard deviation: The square root of the sample variance. The sample standard deviation represents the typical distance from the mean to an observation in the data. The sample standard deviation is a measure of risk.

Sample coefficient of variation: The ratio obtained by dividing the sample standard deviation by the sample mean. This calculation is useful when two different data sets have different means and standard deviations.  For two independent data sets we typically choose the data set with the lower coefficient of variation—less variation per unit of expected value.

Point estimate: A single statistic (number) that is determined from a sample. It is used to estimate the corresponding population parameter.

Sampling error: Differences from the mean that occur due to random chance.

Confidence interval:  An interval computed from a sample that is expected to contain the poplulation parameter with a given level of confidence.

Null hypothesis: The belief that a population parameter is equal to a specific value. The null can be rejected via statistical inference.

Alternative hypothesis: The subsequent test result that leads the researcher to reject the null hypothesis in favor of the alternative hypothesis with a pre-specified level of confidence. The null and alternative hypotheses are mutual exclusive states.

Correlation: The strength of linear association between two variables. Correlation is not causality. A causal relationship exists when the independent variable is the underlying contributing determinant of the dependent variable. A causal relationship may be suggested by correlation; it is not proof a causal relationship exists however.

Correlation coefficient: A numerical measure of the sign and strength of the linear association between two variables. The correlation coefficient will range between -1.00 (negative correlation) and +1.00 (positive correlation).

Linear regression: A statistical method in which a straight line is "fit" to a scatter of point coordinates so as to determine an estimated intercept and slope (the regression coefficients). Once estimated the intercept and slope allow the value of the dependent variable to be obtained from the value of an independent variable. Multiple linear regression uses two or more independent variables to explain a dependent variable.

Regression residual: The difference between an observed value of the dependent variable and the corresponding estimated value from the regression model. Small residuals mean the model leads to more accurate predictions than large residuals. The least squares regression model is one in which the sum of squared residuals is minimized.

Rate of return:The difference between the dollar amount invested at the beginning of the period and the amount received at the end of the period divided by the amount invested. Stocks can have return both from capital appreciation and for dividends. Bonds can have return both from capital appreciation and from interest payments. Rates of return are typically computed on a yearly basis. Annual rates of return can compound over time. An annualized rate of return can be positive or negative. Rates of return are subject to variation (risk). This risk can be measured by the security's standard deviation in an isolated setting. If we replace an actual ending period dollar amount with an expected ending period dollar amount we have what is termed an expected rate of return.

The "market" and return on the market: When we say the "market" in finance we are referring to the overall market for stocks. Since this market contains many stocks it is completely diversified. The only risk associated with the market is non-diversifiable or "market risk." The return on the market (say for one year) is measured using one of the many stock indexes. Perhaps the most popular is the S&P 500 stock index. This index contains 500 stocks from a broad cross-section of U. S. companies. The return on the market is computed as [Beginning index value – Ending index value]/Ending index value. The market return is important because it is the benchmark against which other individual stock returns are judged.

Diversification: The effect that reduces portfolio risk if the securities making up the portfolio are not perfectly positively correlated. Cross security returns tend to moderate each other over time thereby reducing the volatility of any one security held in isolation. A broad market index will be completely diversified and will demonstrate only non-diversifiable or market risk.

 

Test your knowledge, try this Quiz Me Activity!

 

 Toggle open/close quiz question

Match the items.
The task is to match the lettered items with the correct numbered items. Appearing below is a list of lettered items. Following that is a list of numbered items. Each numbered item is followed by a drop-down. Select the letter in the drop down that best matches the numbered item with the lettered alternatives.
a. Data collected via collecting from the responses of a group of participants.
b. Data collected using a pre-specified list establishing the guidelines that will be used in assembling the sample from the population.
c. The universe of event numbers under study
d. A plot that incorporates the median and upper and lower quartiles to graphically display the data range. Also particularly useful for displaying outliers when they are present in the data.
e. A set of data values in which the mean is generally greater than the median. The right tail of the distribution is longer than the left tail of the distribution
f. A probability function that has a vertical line of symmetry creating left/right mirror images. The most well-known example is the bell-shaped Normal distribution that is fully described by it's mean and standard deviation.
g. A set of data values in which the mean is generally less than the median. The left tail of the distribution is longer than the right tail of the distribution.
h. A numerical measure that describes some property of the population.
i. Data compared at one point in time. Comparisons can be intra-data or with a benchmark data point.
j. The number of independent data values available to estimate the population's standard deviation.