Binomial Hypothesis Testing

The demands of Hypothesis testing insist that you should state the variate X, the null hypothesis, Ho, the alternative Hypothesis H1, the assumptions of distribution and the significance level of the test. You must accept or reject the null hypothesis (and write that down). All testing must be followed by a statement in lucid english that explains what the conclusion is, using the terms of the question.

In MEI question papers it has been the habit that the statement of information is worth a mark; sometimes the statement of Ho & H1 has been worth two.

Example: Mrs Clinton is standing for election and she thinks she has 60% of the electorate prepare to vote for her. You think she is overstating her case and you take a sample by asking twenty people who will vote which way they will vote. Eight agree.

X = {the number of people who will vote for her}  X~Bin (20, ).6)

The test will be a one tail test (1tt) at the 5% level

Ho: p = 0.6      she says so

H1: p < 0.6      the case is overstated

The test uses the sample and assumes that more extreme should be counted. We are looking to see how probable it is that the given distribution will yield the sample result. If it is less than 5% we will reject the idea and go for the alternative. The 5% level says that we will accept being wrong in our conclusion 5% of the time. So the test statistic is the result of calculating

P (X≤8 | X~Bin(20,).6) = 0.05754  This is more than 5%, so we accept Ho even if reluctantly, because it didn’t pass your test. We go on to say that there is insufficient evidence to say she is overstating her case. You would expect 20*0.6 = 12 people to agree (on average, asking many sets of twenty people). That 8 or less agree will happen 6% of the time. That 9 or less will agree will happen 12% of the time. If seven people (or less) had said they agree you would have rejected Ho and accepted the alternative, that she is wrong and that she is overstating her case. The critical region for this example (where we will reject Ho) is X≤7.

In an ideal hypothesis test, you establish Ho, H1 and the significance level before you do any sampling. It is important that at the end of the question you interpret the solution in words. One should criticise any test like this: (i) Was the test set up before the data was known? (ii) are the data used independent and random? (iii) Is the procedure testing the original claim?

It is quite important to interpret the answer you have in the words of the question, not just in Stats-speak. Did we disagree with Mrs Clinton? No we weakly agreed with her; our data agreed, if not very well. You would expect any exam question to be near the decision line, where outside the sphere of exams you might expect the information to be a little more clear cut. In most exam questions you actually would like to repeat the sample. Note, though, that two samples of 20, each with nine agreements is not the same as forty with 18. The two samples gives a probability of 0.12752, the larger sample includes more cases and is larger.


1   Experience says that 70% of students pass their driving test at the first attempt. In one year, only ten of the first twenty of the sixth form school pass first time. Does this suggest that this class is less able to drive?. Declare your hypotheses, test at the 5% level, state your conclusion.

2   Light-bulbs are supposed to have a lifetime of 1000 hours and the design is such that 80% should meet this specification. I think this is over optimistic, so I buy 20 and record when they fail. I find that 7 have lifetimes less than advertised and I complain to the manufacturer. Is my complaint justified at the 5% level? Explain your result.

3   A teacher claims that 60% of her pupils pass a test first time. If you test at the 5% level and assume she is over-estimating (the real probability is less), what is the critical region? If she is wrong the other way, what is the critical region? Could these two together represent just disagreeing with her at the 10% significance level?

4   A call-centre is expanding and offers 20 identical jobs to a large population. Of the applicants, three quarters are women. The call-centre finally chooses 8 men and 12 women. The disappointed applicants both men and women, claim gender discrimination. If you test at an appropriate level, is either group justified in their claim?

5   A biologist is researching to see if pollution levels are affecting the local ecology. A certain sort of moth comes in two types, light and dark. In one particular village a previous piece of research showed that 25% were dark, the rest light. Our biologist catches a sample of 15 moths and counts the dark ones. At the10% level, what numbers would say that the situation is changing?

6   When a certain language is written down, there are many uses (occurrences) of the letter X. Devise a test at the 10% level for someone who does not know the language to see if a short passage of 50 letters was written in this language. Comment on your results.

7   A business that grows flower seeds for sale advertises on the packet of what germination rate can be expected. Thus they state the probability that any seed will grow successfully. They do this by keeping in contact with their customers and specialist growers and keeping good records of germination rates. These vary with the grower and the company uses the figures to predict what a typical customer’s success rat e will be. Successful germination varies with the seed. The cost of seed production can be quite high, so the number of seeds in a packet can be an important factor.
One such firm claims that the germination rate of its bean seeds is 80% and puts 25 seeds in a packet.
(i)  How many seeds do you expect to germinate out of 25?
(ii)  What is the probability of exactly 12 germinating?
(iii)  What is the most likely number of seeds to germinate from any one packet?
(iv)  How many seeds must germinate to give reasonable grounds for complaint? Explain your reasoning carefully.
(v)  What might the company do to reduce the variability of the germination rate?
(vi)  How many seeds do you think you need to put in a packet to give 20 successes 95% of the time? Explain this thinking.

Hypothesis testing terminology

The possible occurrences are covered by the two hypotheses, H0 and H1. Suppose H0  says p = p0. Then our test is set up so that for the sample we obtain, the P(H0)  is rejected given that H0  is true) is small, which is our significance level. The probability of being wrong is the same as our significance level. This is a “Type 1 error”. Clearly P(type I error) = significance level

In a hypothesis test, a type I error occurs when the null hypothesis is rejected when it is in fact true; that is, H0  is wrongly rejected. For example, in a clinical trial of a new drug, the null hypothesis might be that the new drug is no better, on average, than the current drug; i.e.

H0 : there is no difference between the two drugs on average.

A type I error would occur if we concluded that the two drugs produced different effects when in fact there was no difference between them. A type II error is when we accept H0 but it is not true. Clearly the probability of being slightly off, p close to p0, is quite high and it would take a large sample to show a small difference.

The following table gives a summary of possible results of any hypothesis test:

                                              Reject H0                    Don't reject H0 

Truth      H0                          Type I Error                    Right decision

               H1                       Right decision                     Type II Error

A type I error is often considered to be more serious, and therefore more important to avoid, than a type II error. The hypothesis test procedure is therefore adjusted so that there is a guaranteed ‘low’ probability of rejecting the null hypothesis wrongly; this probability is never 0. The exact probability of a type II error is generally unknown.

For any given set of data, type I and type II errors are inversely related; the smaller the risk of one, the higher the risk of the other.

A type I error can be referred to as ‘an error of the first kind’, and a type II error one of the second kind.

A test statistic is a quantity calculated from our sample of data. Its value is used to decide whether or not the null hypothesis should be rejected in our hypothesis test. In general, the choice of a test statistic will depend on the assumed probability model and the hypotheses under question.

The critical value(s) for a hypothesis test is a threshold to which the value of the test statistic in a sample is compared to determine whether or not the null hypothesis is rejected. The critical value for any hypothesis test depends on the significance level at which the test is carried out, and whether the test is one-sided or two-sided.

The critical region CR, or rejection region RR, is a set of values of the test statistic for which the null hypothesis is rejected in a hypothesis test. That is, the sample space for the test statistic is partitioned into two regions; one region (the critical region) will lead us to reject the null hypothesis H0, the other will not. So, if the observed value of the test statistic is a member of the critical region, we conclude “Reject H0”; if it is not a member of the critical region then we conclude “Do not reject H0”.

The significance level of a statistical hypothesis test is a fixed probability of wrongly rejecting the null hypothesis H0, if it is in fact true. It is the probability of a type I error and is set by the investigator in relation to the consequences of such an error. That is, we want to make the significance level as small as possible in order to protect the null hypothesis and to prevent, as far as possible, the investigator from inadvertently making false claims. Usually, the significance level is chosen to be 0.05 (or equivalently, 5%).

1  P (X≤10 | X~ Bin (20, 0.7) = 0.048     Reject Ho.   They are less able. Are they independent?

2  P (X≤7 | X~ Bin (20, 0.8) = 0.0867    Accept Ho. Was my sample random &  independent?

3  A two tail test at 10% assumes 5% on each tail. The critical region at 10% is

4  X~ Bin (20,0.5) Accept H0. No claims valid at 10% level.

5  H1: p≠0.25, 2tt. Critical region is 1>X>6. Very crude test; want bigger sample.

6  Critical region is ≤3 and ≥13 (4>X>12) Would like a bigger sample and an easier calculation

© David Scoins 2017