To request a blog written on a specific topic, please email James@StatisticsSolutions.com with your suggestion. Thank you!

Thursday, December 17, 2009

Autocorrelation

Autocorrelation occurs due to the chance correlation of the error term of a particular household with some other household or firm. Autocorrelation is also named chance correlation. Autocorrelation is also applied in the case of time series analysis.

Statistics Solutions is the country's leader in autocorrelation and dissertation statistics. Contact Statistics Solutions today for a free 30-minute consultation.

The process of autocorrelation is defined as the correlation that exists between the members of the series of the observations that are planned with respect to time.

If two types of data are considered – a cross sectional type of data and a time series type of data—then for the cross sectional type of data, if the change in the income of a particular person affects the consumption expenditure of another household other than his, then autocorrelation is present in the data. Similarly, for the time series type of data, if an output is low in one quarter due to a labor strike, and if the data showing low output continues in the next quarter as well, then autocorrelation is supposed to be present in the data.

The process of autocorrelation is defined as the type of lag correlation for a given type of series with itself, which is lagged by several numbers of time units. On the other hand, serial autocorrelation is that type of autocorrelation that is defined as the process of lag correlation between two series in time series data.

There are certain patterns that are exhibited by autocorrelation.

Autocorrelation exhibits patterns among the residual errors. Autocorrelation also occurs in cases when the error shows a cyclical kind of pattern, etc.

The major reason why autocorrelation occurs is because of the inertia or sluggishness that is present in time series data.

The occurrence of the non stationary property in time series data also gives rise to the phenomenon of autocorrelation. Thus, to make the time series almost free of the problem of autocorrelation, the researcher should always make the data stationary.

The researcher should know that autocorrelation can be positive as well as negative. Economic time series generally exhibits positive autocorrelation as the series moves in an upward or downward pattern. If the series moves in a constant upward and downward movement, then autocorrelation is negative.

The major consequence of using ordinary least square (OLS) in the presence of autocorrelation is that it will simply make the estimator inefficient. As a result, the hypothesis testing procedures will give inaccurate results due to the presence of autocorrelation.

There is a popular test called the Durbin Watson test that detects the presence of autocorrelation. This test is conducted under the null hypothesis that there is no autocorrelation in the data. A test statistic called ‘d’ is computed, which is defined as the ratio between the sum of the square of the difference in the residuals with ith and (i-1) time and the square of the residual in ith time. If the upper critical value of the test comes out to be less than the value of ‘d,’ then there is no autocorrelation. If the lower critical value of the test is more than the value of ‘d,’ then there is autocorrelation.

If one detects autocorrelation in the data, then the first thing a researcher should do is that he should try to find whether or not the autocorrelation is pure. If it is pure autocorrelation, then one can transform it into the original model, which is free from pure autocorrelation.

Wednesday, December 16, 2009

Canonical Correlation

A canonical correlation is a correlation between two canonical or latent types of variables. In canonical correlation, one variable is an independent variable and the other variable is a dependent variable. It is important for the researcher to know that unlike regression analysis, in canonical correlation, the researcher can find a relationship between many dependent and independent variables. A statistic called the Wilk’s Lamda is used for testing the significance of the canonical correlation. The work of the canonical correlation is the same as in simple correlation. In both of these, the point is to provide the percentage of the variances in the dependent variable that are explained by the independent variable. So, canonical correlation is defined as the tool that measures the degree of the relationship between the two variates.

Statistics Solutions is the country's leader in canonical correlation and dissertation statistics. Contact Statistics Solutions today for a free 30-minute consultation.

The process of canonical correlation is considered the member of the multiple general linear hypotheses, and therefore the assumptions of multiple regressions are also assumed in canonical correlation as well.

There are concepts and terms associated with canonical correlation. These concepts and terms will help a researcher better understand canonical correlation. They are as follows:

1. Canonical variable or variate: A canonical variable in canonical correlation is defined as the linear combination of the set of original variables. These variables in canonical correlation are a form of latent variables.

2. Eigen values: The value of the Eigen values in canonical correlation are considered as approximately being equal to the square of the value of the canonical correlation. The Eigen values basically reflect the proportion of the variance in the canonical variate, which is explained by the canonical correlation that relates to the two sets of variables.

3. Canonical Weight: The other name for canonical weight is the canonical coefficient. The canonical weight in canonical correlation must first be standardized. It is then used to assess the relative importance of the contribution of the individual’s variable.

4. Canonical communality coefficient: This coefficient in canonical correlation is defined as the sum of the squared structure coefficients for the given type of variable.

5. Redundancy coefficient, d: This coefficient in canonical correlation basically measures the percent of the variance of the original variables of one set that is predicted from the other set through canonical variables.

6. Likelihood ratio test: This significance test in canonical correlation is used to carry out the significance test of all the sources of the linear relationship between the two canonical variables.

There are certain assumptions that are made by the researcher for conducting canonical correlation. They are as follows:

1. It is assumed that the interval type of data is used to carry out canonical correlation.

2. It is assumed in canonical correlation that the relationships should be linear in nature.

3. It is assumed that there should be low multicollinearity in the data while performing canonical correlation. If the two sets of data are highly inter-correlated, then the coefficient of the canonical correlation is unstable.

4. There should be unrestricted variance in canonical correlation. If the variance is not unrestricted, then this might make the canonical correlation look unstable.

Most researchers think that canonical correlation is computed in SPSS. However, canonical correlation is obtained while computing MANOVA in SPSS. In MANOVA, canonical correlation is used in data sets where one refers to the one set of variables as the dependent and the other as the covariates.

Tuesday, December 15, 2009

Chi Square test

Parametric tests are those kinds of tests that involve the use of parameters, and the chi square test is a parametric tests.

Statistics Solutions is the country's leader in chi square tests and dissertation statistics. Contact Statistics Solutions today for a free 30-minute consultation.

There are varieties of chi square tests that are used by the researcher. They are cross tabulation, chi square test for the goodness of fit, likelihood ratio test, chi square test, etc.

The task of the chi square test is to test the statistical significance of the observed relationship with respect to the expected relationship. The chi square statistic is used by the researcher for determining whether or not a relationship exists.

In the chi square test, the null hypothesis is assumed as there not being an association between the two variables that are observed in the study. The chi square test is calculated by evaluating the cell frequencies that involve the expected frequencies in those types of cases when there is no association between the variables. The comparison between the expected type of frequency and the actual observed frequency is then made in the chi square test. The computation of the expected frequency in the chi square test is calculated as the product of the total number of observations in the row and the column, which is divided by the total size of the sample.

The calculation of the chi square type of statistic in the chi square test is done by computing the sum of the square of the deviation between the observed and the expected frequency, which is divided by the expected frequency.

The researcher should know that the greater the difference between the observed and expected cell frequency, the larger the value of the chi square statistic in the chi square test.

In order to determine if the association between the two variables exists, the probability of obtaining a value of chi square should be larger than the one obtained from the chi square test of cross tabulation.

There is one more popular test called the chi square test for goodness of fit.

This type of chi square test called the chi square test for goodness of fit helps the researcher to understand whether or not the sample drawn from a certain population has a specific distribution and whether or not it actually belongs to that specified distribution. This type of chi square test can be applicable to only discrete types of distribution, like Poisson, binomial, etc. This type of chi square test is an alternative test for the non parametric test called the Kolmogorov Smrinov goodness of fit test.

The null hypothesis assumed by the researcher in this type of chi square test is that the data drawn from the population follows the specified distribution. The chi square statistic in this chi square test is defined in a similar manner to the definition in the above type of test. One of the important points to be noted by the researcher is that the expected number of frequencies in this type of chi square test should be at least five. This means that the chi square test will not be valid for those whose expected cell frequency is less than five.

There are certain assumptions in the chi square test.

The random sampling of data is assumed in the chi square test.

In the chi square test, a sample with a sufficiently large size is assumed. If the chi square test is conducted on a sample with a smaller size, then the chi square test will yield inaccurate inferences. The researcher, by using the chi square test on small samples, might end up committing a Type II error.

In the chi square test, the observations are always assumed to be independent of each other.

In the chi square test, the observations must have the same fundamental distribution.

Tuesday, November 24, 2009

Analysis Of Variance (ANOVA)

The question that one usually asks about Analysis of Variance (ANOVA) is about the definition of Analysis of Variance (ANOVA). Analysis of Variance (ANOVA) is defined as the process of examining the differences among the means for two or more populations. The next question that arises in the researcher’s mind is what null hypothesis is assumed in the Analysis of Variance (ANOVA). The answer is that the null hypothesis is assumed as the following: “there exists no significant difference in the means of all the populations that are being examined in the Analysis of Variance (ANOVA).”

Statistics Solutions is the country's leader in Analysis of Variance (ANOVA) and dissertation statistics. Contact Statistics Solutions today for a free 30-minute consultation.

The type of variable on which the Analysis of Variance (ANOVA) is applicable is also an important issue. Analysis of Variance (ANOVA) is applicable in cases where the interval or a ratio type of the dependent variable and one or more categorical type of independent variable is involved. The researchers should also note that the categorical type of variables is considered as the factors in the Analysis of Variance (ANOVA). The combination of the factor levels or the categories in the Analysis of Variance (ANOVA) is generally termed as the treatments.

The Analysis of Variance (ANOVA) technique, which consists of only one categorical type of independent variable, or in other words a single factor, is called one way Analysis of Variance (ANOVA). On the other hand, if the Analysis of Variance (ANOVA) technique consists of two or more than two factors or categorical types of variables or independent variables, then it is called n way Analysis of Variance (ANOVA). In this, the term ‘n’ refers to the number of factors in the Analysis of Variance (ANOVA).

Like regression analysis, the process of Analysis of Variance (ANOVA) also requires the calculation of multiple sums of squares for evaluating the test statistic that is used for testing the null and alternative hypothesis. There is also one difference in Analysis of Variance (ANOVA) and regression analysis, and that is that Analysis of Variance (ANOVA) uses separate and combined means and variances for the samples while evaluating the values that are applicable for the sum of the squares.

Often, the researcher questions what type of test statistic is used for testing the significant difference. The test statistic is nothing but the F statistic that is used in Analysis of Variance (ANOVA). The F test statistic is defined as the ratio between the sample variances. The task of the F test in Analysis of Variance (ANOVA) is to carry out the test of significance of the variability of the components existing in the study.

The most important question is about the assumptions in Analysis of Variance (ANOVA).

The first assumption of Analysis of Variance (ANOVA) is that each sample has been drawn from the population by the process of random sampling.

The second assumption of Analysis of Variance (ANOVA) is that the population from which each sample is randomly drawn should follow normal distribution. In other words, this means that in Analysis of Variance (ANOVA), it is assumed that the error term is normally distributed having its mean as zero and the variance as σ2e.

The third assumption of Analysis of Variance (ANOVA) is that there is homogeneity within the variances of the populations from which the sample has been drawn.

The fourth assumption of Analysis of Variance (ANOVA) is that the population that consists of the random effects (A) is normally distributed having ‘0’ as the mean and σ2a as the variance.

Thursday, November 19, 2009

Validity

Validity refers to the state in which the researcher or the investigator can get assurance that the inferences drawn from the data are error free or accurate. If there is validity in the sample, then there is validity in the population from where that sample has been drawn.

Statistics Solutions is the country's leader in validity and dissertation statistics. Contact Statistics Solutions today for a free 30-minute consultation.

There are basically four major types of Validity. These types are Internal Validity, External Validity, Statistically Conclusive Validity and Construct Validity.

Internal Validity refers to that type of validity where there is a causal relationship between the variables. Internal Validity signifies the causal relationship between the dependent and the independent type of variable. Internal Validity refers to those factors that are the reason for affecting the dependent variable. This type of validity is used in the case of the design of experiments where the treatments are randomly assigned.

External Validity refers to that type of validity where there is a causal relationship between the cause and the effect. The cause and effect in this type of validity are those that are generalized or transferred either to different people or different treatment variables and the measurement variable.

Statistically conclusive validity refers to that type of validity in which the researcher is interested about the inference on the degree of association between the two variables. For instance, in the study of the association between the two variables, the researcher reaches statistically conclusive Validity only if he has performed statistical significance tests upon the hypotheses predicted by him. This type of validity is violated when the researcher reaches two types of errors, namely type I error and type II error.

Type I error causes violation of this type of validity because in this type of error, the researcher rejects the hypothesis which was indeed true.

Type II error causes violation of this type of validity because in this type of error, the researcher accepts the hypothesis which was indeed false.

Construct Validity refers to that type of validity in which the construct of the test is involved in predicting the relationship for the dependent type of variable. For example, construct validity can be drawn with the help of Cronbach’s alpha. In Cronbach’s alpha, it is assumed that if its value is 0.80, then it is considered good for confirmation, and if its value is 0.70, then it is adequate. So, if the construct satisfies such conditions, then the validity holds. Otherwise, it does not.

Convergent/divergent validation and factor analysis is also used to test this type of validity.
There is a strong relationship between validity and reliability. A test is said to be unreliable if it does not hold the conditions of validity. Reliability is a necessary property of the test, but is not the sufficient condition for validity.

Thus, validity plays the significant role in making an accurate inference about the data.
There are certain things that act as a threat to validity. These are as follows:

If the researcher collects insufficient data to attain validity in the inference, this is not feasible because insufficient data will not represent the population as a whole.

If the researcher measures the sample of the population with too few measurement variables, then he also cannot achieve validity of that sample.

If the researcher selects the wrong type of sample, then he too cannot achieve validity in the inference about the population.

If the researcher selects an inaccurate measurement method during analysis, then the researcher would not be able to achieve validity.

Tuesday, November 17, 2009

Kaplan-Meier survival analysis (KMSA)

Kaplan-Meier survival analysis (KMSA) is a method that involves generating tables and plots of the survival or the hazard function for the event history data. Kaplan-Meier survival analysis (KMSA) does not determine the effect of the covariates on either function. Kaplan-Meier survival analysis (KMSA) is a kind of explanatory method for the time to event, where the time is considered as the most prominent variable.

Statistics Solutions is the country's leader in Kaplan-Meier survival analysis (KMSA) and dissertation statistics. Contact Statistics Solutions today for a free 30-minute consultation.

Kaplan-Meier survival analysis (KMSA) consists of certain terms that are very important to know and understand, as these terms form the basis of a strong understanding of Kaplan-Meier survival analysis (KMSA).

The censored cases in Kaplan-Meier survival analysis (KMSA) indicate those cases in which the event has not yet occurred. In this case of Kaplan-Meier survival analysis (KMSA), the event is considered as the variable of interest for the researcher. Kaplan-Meier survival analysis (KMSA) can efficiently compute the survival functions in those cases that are censored in nature.
The time is considered as the continuous variable in Kaplan-Meier survival analysis (KMSA). However, the researcher should note that in Kaplan-Meier survival analysis (KMSA), the initial time of the occurrence of the event must be clearly defined.

There is a variable called a status variable in Kaplan-Meier survival analysis (KMSA). This variable in Kaplan-Meier survival analysis (KMSA) defines the terminal event. This variable in Kaplan-Meier survival analysis (KMSA) should always be continuous in nature and should always be a categorical type of variable.

There is a variable called the stratification variable in Kaplan-Meier survival analysis (KMSA). As the name suggests, the stratification variable in Kaplan-Meier survival analysis (KMSA) should be a categorical type of variable. This variable in Kaplan-Meier survival analysis (KMSA) represents the grouping effect. In the medical field, the stratification variable in Kaplan-Meier survival analysis (KMSA) can be types of cancer, like lung cancer, blood cancer, etc.
The researcher should note that Kaplan-Meier survival analysis (KMSA) provides incorrect results when covariates other than the time are considered as the prominent aspect in obtaining the extent of a certain consequence.

There is a variable called a factor variable in Kaplan-Meier survival analysis (KMSA). The factor variable in Kaplan-Meier survival analysis (KMSA) should be of categorical type. This type of variable in Kaplan-Meier survival analysis (KMSA) is used by the researcher to indicate the causal effect of a particular consequence. For example, in the case of the previous example, the treatment applied to decrease the effect of the cancer in the body is considered to be the factor variable in Kaplan-Meier survival analysis (KMSA).

The factor variable in Kaplan-Meier survival analysis (KMSA) is the main grouping variable, whereas the stratification variable is the sub grouping variable in Kaplan-Meier survival analysis (KMSA).

Kaplan-Meier survival analysis (KMSA) can be carried out by the researcher with the help of SPSS software.

The log rank test in Kaplan-Meier survival analysis (KMSA) provided in SPSS allows the investigator to examine whether or not the survival functions are equivalent to each other, by measuring their individual time points.

There are certain assumptions that are made in Kaplan-Meier survival analysis (KMSA). For one, it is assumed in Kaplan-Meier survival analysis (KMSA) that the events that occur in the survival function are the dependent variables that depend only upon the time. This is due to the fact that it has been assumed in Kaplan-Meier survival analysis (KMSA) that survival is always based upon time. Thus, this implies that in Kaplan-Meier survival analysis (KMSA), both the censored and uncensored cases perform in similar manners.

Monday, November 16, 2009

Hierarchical Linear Modeling

Suppose that a researcher wants to conduct Hierarchical Linear Modeling on educational data. Hierarchical linear modeling is a kind of regression technique that is designed to take the hierarchical structure of educational data into account.

Statistics Solutions is the country's leader in hierarchical linear modeling and dissertation statistics. Contact Statistics Solutions today for a free 30-minute consultation.

Hierarchical Linear Modeling is generally used to monitor the determination of the relationship among a dependent variable (like test scores) and one or more independent variables (like a student’s background, his previous academic record, etc).

In Hierarchical Linear Modeling, the assumption of the classical regression theory that the observations of any one individual are not systematically related to the observations related to any other individual is violated. This assumption is violated in Hierarchical Linear Modeling because this yields biased estimates by applying this assumption in classical regression theory.

Hierarchical Linear Modeling is also called the method of multi level modeling. Hierarchical Linear Modeling allows the researcher working on educational data to systematically ask questions about how policies can affect a student’s test scores.

The advantage of Hierarchical Linear Modeling is that Hierarchical Linear Modeling allows the researcher to openly examine the effects on student test scores when the policy relevant variables are used on it (like the class size, or the introduction of a particular reform etc.).

Hierarchical Linear Modeling is conducted by the researcher in two steps.

In the first step of Hierarchical Linear Modeling, the researcher must conduct the analyses individually for every school (in the case of educational data) or some other unit in the system.

The first step of Hierarchical Linear Modeling can be very well explained with the help of the following example. In the first step of Hierarchical Linear Modeling, the student’s academic scores in science are regressed on a set of student level predictor variables like a student’s background and a binary variable representing the student’s sex.

In the first step of Hierarchical Linear Modeling, the equation would be expressed mathematically as the following:

(Science)ij=β0j+β1j(SBG)ij+β2j(Male)ij+eij. In this first step of Hierarchical Linear Modeling, β0 would signify the level of performance for each school under consideration after controlling the SBG (student’s background) and sex. In this first step of Hierarchical Linear Modeling, β1 and β2 indicate the extent to which inequalities exist among the student with respect to the two different variables taken under consideration.

In the second step of Hierarchical Linear Modeling, the regression parameters that are obtained from the first step of Hierarchical Linear Modeling become the outcome variables of interest.

The second step of Hierarchical Linear Modeling can be very well explained with the help of the following example. In the second step of Hierarchical Linear Modeling, the outcome variables mean the estimate of the magnitude of consequence of the policy variable. In the second step of Hierarchical Linear Modeling, the β0j is given by the following formula:

β0j = Y00 + Y01(class size)j + Y02 (Discipline)j + U01.

In the second step of Hierarchical Linear Modeling, Y01 indicates the expected gain (or loss) in the test score of science due to an average reduction in the size of the class. In the second step of Hierarchical Linear Modeling, Y02 signifies the effect of the policy of the discipline implemented in the school.

According to Goldstein in 1995 and Raudenbush and Bryk in 1986, Hierarchical Linear Modeling’s statistical and computing techniques involve the incorporation of a multi level model into a single one. This is where regression analyses is performed (it has been already explained in the above two steps of Hierarchical Linear Modeling). Hierarchical Linear Modeling estimates the parameters specified in the model with the help of iterative procedures.