当前位置: 首页  / 国际部  / 格中学国际课程班  / 正文
格致安生美国课程班学生统计学论文
  

Project leader and editor

Oscar Wang 王沈昱

 

Participants

Kayla Chen 陈方仪

 

Advisor:

Jack Xu  徐明

 

Contents

                                                      Page Number

1.      Introduction                                            2

2.      Description of Sampling Method                        3

3.      Summary of Data                                       5

4.      Hypothesis Testing                                      8

5.      Conclusion                                             12

6.      Reference List                                          16

 

1.      Introduction

Concerned with the circumstance of the education of mountain region in Guizhou Province, southwest China, we decided to make such a statistic project. In this project, we aim to collect sample data in a regional school to find out whether a student’s family circumstance is associated with his or her insight to the importance of education. Education department claimed that the wealthier student’s family is, the more important he or she judges education as. So we want to estimate whether students in this school follow this claim. Also, mountain region families generally have preference of boys to girls. How this phenomenon may affect the difference between boys and girls is also included in this study. Two separate two-way tables are needed when organizing data. Then we will interpret the association by conducting a chi-square test of independence between “family circumstances” and “importance of education” for both boys and girls.

 This project will be reported into 5 parts in the following.

 First, we will introduce the sampling method we used which called simple random sample (SRS) and briefly describe its strengths and weaknesses. Then, we will describe how we chose the samples in a given population. Besides, we will also identify the main problems we have met when collecting data.

Secondly, we design a questionnaire survey paper and collect data from a certain number of students in a given population by using simple random sampling twice and fill the data collected into the two-way tables.

Thirdly, we will describe and analyze the two-way tables. And we will present the results of both boys and girls. Then, based on the result, we will take some comparisons between boys and girls.

 Then, we will perform a suitable hypothesis test at 5% significance level. We will introduce the theory of hypothesis testing first and define the key points in hypothesis testing. Then we will do the chi-square test of independence. Then, we need to give an answer through the hypothesis testing.

Finally, we will give a suitable conclusion for what we have done and give the answer to objectives based on the results and support.

2.      Description of Sampling Method

   This project focuses on surveying the relationship between “family circumstances” and “importance of education” with simple random sampling method. It surveyed 318 students with 152 boys and 166 girls. Data were collected in right way and then we will discuss the problems in this project.

     Let us introduce the simple random sampling method first.

Simple random sample is a subset of individuals (a sample) chosen from a larger set (a population). Each individual is chosen randomly and entirely by chance, such that each individual has the same probability of being chosen at any stage during the sampling process, and each subset of k individuals has the same probability of being chosen for the sample as any other subset of k individuals (Yates, Daniel S.; David S. Moore, Daren S. Starnes, 2008). As an example, in this project, we divide boys and girls as two populations. Then, within each population, we randomly select 180 students with following steps.

We get the information from the school that there are totally 1640 students, 962 boys and 678 girls. So we use the random digit table to choose the random sample within each population. First, label the 962 boys with numbers 001 to 962.

Second, we read the random digit table (RDT) by three-digit number.

Next, we select the number between 001 and 962, and skip the repeated number.

Finally, we pick out the boys corresponding to the first 180 selected numbers into the first group, and survey them.

After that, we randomly select 180 girls using the same methods, and the sampling students are determined.

    However, there also have some flaws as following states. First, the sampling method has no choice but complicate the survey process. Since in school we have huge limitations at finding students, we decide to announce the survey through school broadcast. But gathering students is a big challenge. We do the sample survey at noon, and most of girls (166 of 180) respond to the announcement, but finding boys are very difficult. We find only 156 boys of 180 at last.

    During this project, we design a questionnaire survey paper to gather children; however, four students chosen from grade 1 and grade 2 are too young to take the survey that they even couldn’t understand the questions, so we exclude these 4 students, who are all boys from the sample. Finally, 152 boys and 166 girls are in the sample.

   Moreover, data are all collected by using anonymity. It may cause some problems. The conclusion we have got may be slightly different from which should be. For example, one student does not have a good family background. However, the answer “above average” is chosen unexpectedly when he meets the question. It may lead to a response error in the data collections, that is, students may be more inclined to choose “above average” from the choices. Therefore, the true sample data may be slightly different from which should be.

3.      Summary of Data

 Here are two questions we design on the questionnaire paper.

Sex ___________

(You can only choose ONE answer in both the questions below)

 

1. Your family circumstance can be described as (    )

A. Very bad   B. Bad   C. Above average  

 

2. In your opinion, how much influence does your education level have on your future life? (    )

A. No influence   B. A little influence   C. Strong Influence  

 

 

 

 

 

 

 

 


Two-way Table:

Importance

of education

Family Circumstances

Y1           Y2          ……          Yc

X1

X2

   

Xr

     N11          N12         ……          N1c

     N21          N22         ……          N2c

   

     Nr1          Nr2         ……          Nrc

  r

 i=1

  c

 j=1

∑∑= n

 

Table of data for boys/ girls

Importance

of education

Family Circumstances

Very Bad

Bad

Above Average

No influence

 

 

 

A little influence

 

 

 

Strong influence

 

 

 

 

Now, we fill the data into the blank.

Table of data for boys

Importance

of education

Family Circumstances

Very Bad

Bad

Above Average

No influence

10

8

12

A little influence

18

8

30

Strong influence

28

10

28

 

Table of data for girls

Importance

of education

Family Circumstances

Very Bad

Bad

Above Average

No influence

6

8

8

A little influence

16

20

40

Strong influence

18

10

40

             

We will use Chi-square test for independence and the process will be described later.

Preparing for the following testing, we firstly work out the row total and the column total of the two blanks. The number is very important in the testing process.

 

Table of data for boys

Importance

Column Total         56            26            70       152

Row Total

30            56            66

 

of education

Family Circumstances

Very Bad

Bad

Above Average

No influence

10

8

12

A little influence

18

8

30

Strong influence

28

10

28

 

Table of data for girls

Importance

of education

Family Circumstances

Very Bad

Bad

Row Total

22           

76           

68

Above Average

No influence

6

8

8

A little influence

16

20

40

Strong influence

18

10

40

Column Total         40            38            88       166

 

 

4.      Hypothesis Testing

 In this section, we will first define the idea of hypothesis. Then, we will perform a suitable hypothesis testing when the level of significant required is 5% for boys and girls.

A hypothesis is “a statement about the values of the parameters of a probability distribution’ probability” (Douglas, 2009, p112). A statistical hypothesis test is a method of making decisions using data (Fisher, 1925). It is important to determine the values specified into null hypothesis and alternative hypothesis.

    In the hypothesis testing, the hypothesis that we want to test is known as null hypothesis as H0 and the statement Ha is the alternative hypothesis which is considered that the null hypothesis is wrong (Douglas & Jeffrey, 2003). For the hypothesis testing for independence, suppose we are interested in the relationship of two categorical variables, we should find whether the two variables are independent from each other.

Null Hypothesis:

H0: A student’s family circumstance and his (for girls, her) insight to the importance of education are independent.

Alternative Hypothesis:

Ha: A student’s family circumstance is associated with his (for girls, her) insight to the importance of education.

To do the hypothesis testing, we should take a random sample from the population to find a test statistic. Test statistic is “considered as a numerical summary of a set of data that reduces the data to one or a small number of values” (Berger and Casella, 2001). Critical region for the test is the zone of acceptance that means to reject H0. Then, the boundary of value to the rejection of H0 is known as critical value (Douglas & Jeffrey, 2003). Afterwards, we need to find the testing value χ² and calculate the p-value to make a comparison to the significance level.

 We first deal with the testing for boys.

State the hypothesis:

       H0: A boy’s family circumstance and his insight to the importance of education are independent.

Ha: A boy’s family circumstance is associated with his insight to the importance of education.

Check the assumptions:

1.                   An SRS is chosen from the population: the process of doing SRS is listed thoroughly in the previous part.

2.                   Large sample size: all expected values are greater than 5

 

In Chi-square test for independence, we should introduce a definition of Expected Counts (EC). If every expected count of the data is greater than 5, χ² test can be used.

EC= (Row total*Column Total)/ n

 

Table of expected counts for boys

(round up to the 3rd decimal)

11.053

5.132

13.816

20.632

9.579

25.789

24.316

11.289

30.395

 

 

So the all the expected counts are larger than 5 and chi-square test for independence is suitable.

Then, we calculate the test statistic,

χ²=

OC means observed counts, which are the data collected in the sample (10, 18, 12…), and EC has been calculated.

For boys: χ²= =4.120, df= ( r-1 )( c-1 )=4

 

 

Next, we should calculate the p-value. In statistical significance testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. One often "rejects the null hypothesis" when the p-value is less than the predetermined significance level which is often 0.05 or 0.01, indicating that the observed result would be highly unlikely under the null hypothesis (Goodman, 1999). If the p-value is smaller than 0.05, then, we should reject H0. However, if the p-value is larger than 0.05, thus, we should not reject H0.

 

For boysP(χ²(4)χ²)= 0.3900.05

 

Therefore, it is definitely sure that the p-value is larger than α=0.05. Then, we cannot reject H0 at levelα=0.05.

Thus, there’s no sufficient evidence that a boy’s family circumstance is associated with his insight to the importance of education.

 

Now, we deal with the testing for girls with the same method. (We may simplify the process)

 

State the hypothesis:

       H0: A girl’s family circumstance and her insight to the importance of education are independent.

Ha: A girl’s family circumstance is associated with her insight to the importance of education.

Check the assumptions:

3.                   An SRS is chosen from the population: the process of doing SRS is listed thoroughly in the previous part.

4.                   Large sample size: all expected values are greater than 5

 

Table of expected counts for girls

(round up to the 3rd decimal)

5.301

5.036

11.663

18.313

17.398

40.289

16.386

15.566

36.048

 

So the all the expected counts are larger than 5 and chi-square test for independence is suitable.

Then, we calculate the test statistic,

For girls:   χ²= =6.253, df= ( r-1 )( c-1 )=4

P(χ²(4)χ²)= 0.1810.05

 

Thus, there’s no sufficient evidence that a girl’s family circumstance is associated with her insight to the importance of education.

 

5.      Conclusion

The objective of the project is to find out whether a student’s family circumstance is associated with his or her insight to the importance of education. Based on the result of hypothesis testing, it has shown no evidence that “family circumstances” and “importance of education” are associated.

First, we have described the sampling method and its strengths and weakness. Moreover, we described the problems we had met such as difficulty of collecting the paper from students and inability of understanding the questions from the students we surveyed due to their young ages.

Then, we had chosen of 152male and 166 female in a given population based on the stratified method. Then, we surveyed and collected the data.

Thirdly, we made two two-way tables for boys and girls. Record the observe counts into the blank. Moreover, we take row total and column total as importance to calculate the expected counts of the table.

Afterwards, we did hypothesis testing at the 5% significance level. In hypothesis testing, we do not reject H0 and it could not tell that a student’s family circumstance is associated with his or her insight to the importance of education.

Based on the calculation of hypothesis testing has help us to reach the answer. Therefore, we can estimate that a student’s family circumstance and his or her insight to the importance of education might be independent.

Since we take our sample in a representative mountain region school in Guizhou Province, the conclusion may be representative in most of mountain region schools in Guizhou Province, but for the city schools in Guizhou Province, no conclusion can be drawn from our study. So the conclusion of “independent” is restrictive to the mountain region school in Guizhou Province, but incapable of representing the whole population of whole Guizhou school.

 

Further Insight:

Although the conclusions are the same of boys and girls, we can still figure out some further conclusion from the data.

We originally believed that in the mountain region, families generally have preference of boys to girls. So we estimated that boys emphasized the importance more than girls did. But the data don’t show such difference.

Now, we compare the two p-values. At the 5% significance level, both boys and girls are not significant. However, boys’ p-value is still larger than girls’ p-value (0.390>0.181). That is to say, girls’ p-value is more significant than boy’s, though not small enough to be significant. This phenomenon proved our initial estimate that families may have preference of boys to girls. A relatively more significant relation means that, with a poorer family circumstances, girls may be more inclined that the education is not important than boys may.

Next, we interpret the data in two-way table with the means of graphing. We introduce segmented bar graph. Segmented bar graph, also called “Stacked bar graph”, is a graph that is used to compare the parts to the whole. The bars in a stacked bar graph are divided into categories. Each bar represents a total.

So we can form the two graphs:

  


From the segmented bar graphs above, we surprisingly found that the bars “Very bad” and “Above average” are similar, but the bar “bad” distributes very differently. So we estimate that if we only take “very bad” and “above average” into account, the result may be significant. We can carry out a hypothesis that students who have a fair family judges education as important because they may want to improve their spiritual world and accumulate their wealth through education; students who have a very bad family judges education as important because they judge education as a life-saving straw of their situations; however, students who have a bad family may judge education as less important because they can maintain their living and education is not a necessity for them.

We have just figure out that the p-value of girls is more significant than that of boys. We can confirm that through the graph above, especially of “strong influence”. For girls, they have 45.5% in “above average”, and 26.3% in “bad”, and the former is nearly twice large as the latter. But for boys, the former is nearly the same percentage as the latter (40.0% versus 38.5%). So the result fits that of p-value.

Then, we use two types of error to interpret the statistics. In statistics, if the result of the test corresponds with reality, then a correct decision has been made. However, if the result of the test does not correspond with reality, then an error has occurred. Due to the statistical nature of a test, the result is never, except in very rare cases, free of error. Two types of error are distinguished: Type I error and Type II error.

A type I error, occurs when the null hypothesis (H0) is true, but is rejected. It is asserting something that is absent, a false hit. A type I error may be compared with a so-called false positive (a result that indicates that a given condition is present when it actually is not present) in tests where a single condition is tested for (Shermer Michael, 2002).

A type II error, occurs when the null hypothesis is false, but erroneously fails to be rejected. It is failing to assert what is present, a miss. A type II error may be compared with a so-called false negative (where an actual hit was disregarded by the test and seen as a miss) in a test checking for a single condition with a definitive result of true or false. A Type II error is committed when we fail to believe a truth. In terms of our study, we may mistakenly fail to believe that “family circumstances” and “importance of education” are associated when they are actually related (Shermer Michael, 2002).

The probability of committing a Type II error is called β.

P( fail to reject H0| Ha is true) =β

Learnt from this project, if we want to reduceβ to increase the power of test next time, we have two methods. First, we can increase the significance level α, that is, it will be more likely to reject H0, and the probability of committing a Type II error decreases.

Second, the study should contain more sample size that the power will increase and both the probability of committing Type I error and Type II error will decline significantly.

Finally, we go back to our very beginning. Education department claimed that the wealthier student’s family is, the more important he or she judges education as. It may not be true. According to the project, it is possible that importance of education are not simply increasing or decreasing with the change of family circumstance, but a complicated relationship similar with curve (increase and then decrease and increase again, etc.) This requires a further and more powerful study to figure out. I hope that after entering the college, I would gain the knowledge base to further my projects.

 

6.Reference List:

1.      Douglas, C. Montgomery, (2009). Introduction to Statistical Quality Control, 6th Edition.

2.      Yates, Daniel S.; David S. Moore, Daren S. Starnes (2008). The Practice of Statistics, 3rd Ed. Freeman. ISBN 978-0-7167-7309-2. Access at http://en.wikipedia.org/wiki/Simple_random_sample.  

3.      Douglas Downing & Jeffrey Clark, (2003). Business Statistics, 4th Edition.

4.      R. A. Fisher (1925). Statistical Methods for Research Workers, Edinburgh: Oliver and Boyd, 1925, p.43.

5.       R. L. Berger and G. Casella (2001). Statistical Inference, Duxbury Press, Second Edition, 2001, p.374

6.      Goodman, SN (1999). "Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy". Annals of Internal Medicine 130: 995–1004. Access at http://en.wikipedia.org/wiki/P-value.

7.      Shermer, Michael (2002). The Skeptic Encyclopedia of Pseudoscience 2 volume set. ABC-CLIO. p. 455. ISBN 1-57607-653-9. Retrieved 10 January 2011. Access at http://en.wikipedia.org/wiki/Type_I_error#Type_I_error.

8.Definition of Stacked Bar Graph, access at

 http://www.icoachmath.com/math_dictionary/Stacked_Bar_Graph.html

相关信息