Properties of operations for calculating quantitative characteristics of random variables. Main characteristics of random variables Dispersion and standard deviation

The purpose of correlation analysis is to identify an estimate of the strength of the connection between random variables (features) that characterizes some real process.
Problems of correlation analysis:
a) Measurement of the degree of connection (tightness, strength, severity, intensity) of two or more phenomena.
b) The selection of factors that have the most significant impact on the resulting attribute, based on measuring the degree of connectivity between phenomena. Significant factors in this aspect are used further in the regression analysis.
c) Detection of unknown causal relationships.

The forms of manifestation of interrelations are very diverse. As their most common types, functional (complete) and correlation (incomplete) connection.
correlation manifests itself on average, for mass observations, when the given values ​​of the dependent variable correspond to a certain number of probabilistic values ​​of the independent variable. The connection is called correlation, if each value of the factor attribute corresponds to a well-defined non-random value of the resultant attribute.
Correlation field serves as a visual representation of the correlation table. It is a graph where X values ​​are plotted on the abscissa axis, Y values ​​are plotted along the ordinate axis, and combinations of X and Y are shown by dots. The presence of a connection can be judged by the location of the dots.
Tightness indicators make it possible to characterize the dependence of the variation of the resulting trait on the variation of the trait-factor.
A better indicator of the degree of tightness correlation is linear correlation coefficient. When calculating this indicator, not only the deviations of the individual values ​​of the attribute from the average are taken into account, but also the magnitude of these deviations.

The key issues of this topic are the equations of the regression relationship between the resulting feature and the explanatory variable, the least squares method for estimating the parameters of the regression model, analyzing the quality of the resulting regression equation, building confidence intervals for predicting the values ​​of the resulting feature using the regression equation.

Example 2


System of normal equations.
a n + b∑x = ∑y
a∑x + b∑x 2 = ∑y x
For our data, the system of equations has the form
30a + 5763 b = 21460
5763 a + 1200261 b = 3800360
From the first equation we express A and substitute into the second equation:
We get b = -3.46, a = 1379.33
Regression equation:
y = -3.46 x + 1379.33

2. Calculation of the parameters of the regression equation.
Sample means.



Sample variances:


standard deviation


1.1. Correlation coefficient
covariance.

We calculate the indicator of closeness of communication. Such an indicator is a selective linear correlation coefficient, which is calculated by the formula:

The linear correlation coefficient takes values ​​from –1 to +1.
Relationships between features can be weak or strong (close). Their criteria are evaluated on the Chaddock scale:
0.1 < r xy < 0.3: слабая;
0.3 < r xy < 0.5: умеренная;
0.5 < r xy < 0.7: заметная;
0.7 < r xy < 0.9: высокая;
0.9 < r xy < 1: весьма высокая;
In our example, the relationship between feature Y and factor X is high and inverse.
In addition, the coefficient of linear pair correlation can be determined in terms of the regression coefficient b:

1.2. Regression Equation(evaluation of the regression equation).

The linear regression equation is y = -3.46 x + 1379.33

The coefficient b = -3.46 shows the average change in the effective indicator (in units of y) with an increase or decrease in the value of the factor x per unit of its measurement. In this example, with an increase of 1 unit, y decreases by an average of -3.46.
The coefficient a = 1379.33 formally shows the predicted level of y, but only if x=0 is close to the sample values.
But if x=0 is far from the sample x values, then a literal interpretation can lead to incorrect results, and even if the regression line accurately describes the values ​​of the observed sample, there is no guarantee that this will also be the case when extrapolating to the left or to the right.
By substituting the corresponding values ​​of x into the regression equation, it is possible to determine the aligned (predicted) values ​​of the effective indicator y(x) for each observation.
The relationship between y and x determines the sign of the regression coefficient b (if > 0 - direct relationship, otherwise - inverse). In our example, the relationship is reverse.
1.3. elasticity coefficient.
It is undesirable to use regression coefficients (in example b) for a direct assessment of the influence of factors on the effective attribute in the event that there is a difference in the units of measurement of the effective indicator y and the factor attribute x.
For these purposes, elasticity coefficients and beta coefficients are calculated.
The average coefficient of elasticity E shows how many percent the result will change on average in the aggregate at from its average value when changing the factor x 1% of its average value.
The coefficient of elasticity is found by the formula:


The elasticity coefficient is less than 1. Therefore, if X changes by 1%, Y will change by less than 1%. In other words, the influence of X on Y is not significant.
Beta coefficient shows by what part of the value of its standard deviation the value of the effective attribute will change on average when the factor attribute changes by the value of its standard deviation with the value of the remaining independent variables fixed at a constant level:

Those. an increase in x by the value of the standard deviation S x will lead to a decrease in the average value of Y by 0.74 standard deviation S y .
1.4. Approximation error.
Let us evaluate the quality of the regression equation using the absolute approximation error. The average approximation error is the average deviation of the calculated values ​​from the actual ones:


Since the error is less than 15%, this equation can be used as a regression.
Dispersion analysis.
The task of analysis of variance is to analyze the variance of the dependent variable:
∑(y i - y cp) 2 = ∑(y(x) - y cp) 2 + ∑(y - y(x)) 2
Where
∑(y i - y cp) 2 - total sum of squared deviations;
∑(y(x) - y cp) 2 - sum of squared deviations due to regression (“explained” or “factorial”);
∑(y - y(x)) 2 - residual sum of squared deviations.
Theoretical correlation ratio for a linear relationship is equal to the correlation coefficient r xy .
For any form of dependence, the tightness of the connection is determined using multiple correlation coefficient:

This coefficient is universal, as it reflects the tightness of the connection and the accuracy of the model, and can also be used for any form of connection between variables. When constructing a one-factor correlation model, the multiple correlation coefficient is equal to the pair correlation coefficient r xy .
1.6. Determination coefficient.
The square of the (multiple) correlation coefficient is called the coefficient of determination, which shows the proportion of the variation of the resultant attribute explained by the variation of the factor attribute.
Most often, giving an interpretation of the coefficient of determination, it is expressed as a percentage.
R 2 \u003d -0.74 2 \u003d 0.5413
those. in 54.13% of cases, changes in x lead to a change in y. In other words, the accuracy of the selection of the regression equation is average. The remaining 45.87% of the change in Y is due to factors not taken into account in the model.

Bibliography

  1. Econometrics: Textbook / Ed. I.I. Eliseeva. - M.: Finance and statistics, 2001, p. 34..89.
  2. Magnus Ya.R., Katyshev P.K., Peresetsky A.A. Econometrics. Initial course. Tutorial. - 2nd ed., Rev. – M.: Delo, 1998, p. 17..42.
  3. Workshop on econometrics: Proc. allowance / I.I. Eliseeva, S.V. Kurysheva, N.M. Gordeenko and others; Ed. I.I. Eliseeva. - M.: Finance and statistics, 2001, p. 5..48.

The company employs 10 people. Table 2 shows data on their work experience and

monthly salary.

Calculate from this data

  • - the value of the sample covariance estimate;
  • - the value of the sample Pearson correlation coefficient;
  • - evaluate the direction and strength of the connection according to the obtained values;
  • - determine how legitimate the statement that this company uses the Japanese management model, which consists in the assumption that the more time an employee spends in this company, the higher his salary should be.

Based on the correlation field, one can hypothesize (for the general population) that the relationship between all possible values ​​of X and Y is linear.

To calculate the regression parameters, we will build a calculation table.

Sample means.

Sample variances:

The estimated regression equation will look like

y = bx + a + e,

where ei are the observed values ​​(estimates) of the errors ei, a and b, respectively, the estimates of the parameters b and in the regression model that should be found.

To estimate the parameters b and c - use LSM (least squares).

System of normal equations.

a?x + b?x2 = ?y*x

For our data, the system of equations has the form

  • 10a + 307b = 33300
  • 307 a + 10857 b = 1127700

We multiply the equation (1) of the system by (-30.7), we get a system that we solve by the method of algebraic addition.

  • -307a -9424.9 b = -1022310
  • 307 a + 10857 b = 1127700

We get:

1432.1b = 105390

Where b = 73.5912

Now we find the coefficient "a" from equation (1):

  • 10a + 307b = 33300
  • 10a + 307 * 73.5912 = 33300
  • 10a = 10707.49

We get empirical regression coefficients: b = 73.5912, a = 1070.7492

Regression equation (empirical regression equation):

y = 73.5912 x + 1070.7492

covariance.

In our example, the relationship between feature Y and factor X is high and direct.

Therefore, we can safely say that the more time an employee works in a given company, the higher his salary.

4. Testing statistical hypotheses. When solving this problem, the first step is to formulate a testable hypothesis and an alternative one.

Checking the equality of general shares.

A study was conducted on student performance at two faculties. The results for the variants are shown in Table 3. Can it be argued that both faculties have the same percentage of excellent students?

simple arithmetic mean

We test the hypothesis about the equality of the general shares:

Let's find the experimental value of Student's criterion:

Number of degrees of freedom

f \u003d nx + ny - 2 \u003d 2 + 2 - 2 \u003d 2

Determine the value of tkp according to the Student's distribution table

According to Student's table we find:

Ttabl(f;b/2) = Ttabl(2;0.025) = 4.303

According to the table of critical points of the Student's distribution at a significance level of b = 0.05 and a given number of degrees of freedom, we find tcr = 4.303

Because tobs > tcr, then the null hypothesis is rejected, the general shares of the two samples are not equal.

Checking the uniformity of the general distribution.

The university management wants to find out how the popularity of the Faculty of Humanities has changed over time. The number of applicants who applied for this faculty was analyzed in relation to the total number of applicants in the corresponding year. (Data are given in Table 4). If we consider the number of applicants as a representative sample of the total number of school graduates of the year, can it be argued that the interest of schoolchildren in the specialties of this faculty does not change over time?

Option 4

Solution: Table for calculating indicators.

Interval midpoint, xi

Cumulative frequency, S

Frequency, fi/n

To evaluate the distribution series, we find the following indicators:

weighted average

The range of variation is the difference between the maximum and minimum values ​​of the attribute of the primary series.

R = 2008 - 1988 = 20 Dispersion - characterizes the measure of spread around its mean value (measure of dispersion, i.e. deviation from the mean).

Standard deviation (mean sampling error).

Each value of the series differs from the average value of 2002.66 by an average of 6.32

Testing the hypothesis about the uniform distribution of the general population.

In order to test the hypothesis about the uniform distribution of X, i.e. according to the law: f(x) = 1/(b-a) in the interval (a,b) it is necessary:

Estimate the parameters a and b - the ends of the interval in which the possible values ​​of X were observed, according to the formulas (the * denotes the estimates of the parameters):

Find the probability density of the estimated distribution f(x) = 1/(b* - a*)

Find theoretical frequencies:

n1 = nP1 = n = n*1/(b* - a*)*(x1 - a*)

n2 = n3 = ... = ns-1 = n*1/(b* - a*)*(xi - xi-1)

ns = n*1/(b* - a*)*(b* - xs-1)

Compare empirical and theoretical frequencies using the Pearson test, assuming the number of degrees of freedom k = s-3, where s is the number of initial sampling intervals; if, however, a combination of small frequencies, and therefore the intervals themselves, was made, then s is the number of intervals remaining after the combination. Let's find the estimates of the parameters a* and b* of the uniform distribution by the formulas:

Let's find the density of the supposed uniform distribution:

f(x) = 1/(b* - a*) = 1/(2013.62 - 1991.71) = 0.0456

Let's find the theoretical frequencies:

n1 = n*f(x)(x1 - a*) = 0.77 * 0.0456(1992-1991.71) = 0.0102

n5 = n*f(x)(b* - x4) = 0.77 * 0.0456(2013.62-2008) = 0.2

ns = n*f(x)(xi - xi-1)

Since the Pearson statistic measures the difference between the empirical and theoretical distributions, the larger its observed value Kobs, the stronger the argument against the main hypothesis.

Therefore, the critical region for this statistic is always right-handed: ) may differ significantly from the corresponding characteristics of the original (undistorted) scheme (, n). normal scheme (, m) always reduces the absolute value of the regression coefficient Ql in relation (B. 15), and also weakens the degree of closeness of the relationship between um (ie, reduces the absolute value of the correlation coefficient r).

Influence of measurement errors on the value of the correlation coefficient. Let us want to estimate the degree of closeness of the correlation between the components of a two-dimensional normal random variable (, TJ), but we can observe them only with some random measurement errors, respectively, es and e (see the D2 dependence diagram in the introduction). Therefore, the experimental data are (xit i/i), i = 1, 2,. .., n, are practically sample values ​​of the distorted two-dimensional random variable (, r)), where =

Method R.a. consists in deriving a regression equation (including an estimate of its parameters), with the help of which the average value of a random variable is found, if the value of another (or others in the case of multiple or multivariate regression) is known. (In contrast, correlation analysis is used to find and express the strength of the relationship between random variables71.)

In the study of the correlation of signs that are not connected by a consistent change in time, each sign changes under the influence of many causes, taken as random. In the series of dynamics, a change is added to them during the time of each series. This change leads to the so-called autocorrelation - the influence of changes in the levels of previous series on subsequent ones. Therefore, the correlation between the levels of time series correctly shows the tightness of the relationship between the phenomena reflected in the time series, only if there is no autocorrelation in each of them. In addition, autocorrelation leads to a distortion of the mean square errors of the regression coefficients, which makes it difficult to build confidence intervals for the regression coefficients, as well as to check their significance.

The theoretical and sample correlation coefficients defined by relations (1.8) and (1.8), respectively, can be formally calculated for any two-dimensional observational system; they are measures of the degree of tightness of the linear statistical relationship between the analyzed features. However, only in the case of a joint normal distribution of the random variables under study and u, the correlation coefficient r has a clear meaning as a characteristic of the degree of closeness of the connection between them. In particular, in this case, the ratio r - 1 confirms a purely functional linear relationship between the quantities under study, and the equation r = 0 indicates their complete mutual independence. In addition, the correlation coefficient, together with the means and variances of random variables and TJ, constitutes those five parameters that provide comprehensive information about

Having determined the equation of the theoretical regression line, it is necessary to quantify the closeness of the relationship between the two series of observations. The regression lines drawn in fig. 4.1, b, c, are the same, but in fig. 4.1, b, the points are much closer (closer) to the regression line than in Fig. 4.1, c.

Correlation analysis assumes that the factors and responses are random and obey the normal distribution law.

The closeness of the relationship between random variables is characterized by the correlation ratio pxy. Let us dwell in more detail on the physical meaning of this indicator. To do this, we introduce new concepts.

Residual dispersion

observed points relative to the regression line and is an indicator of the error in predicting the parameter y according to the regression equation (Fig. 4.6):



s2=f)