12 Bootstrapping and Confidence Intervals

12.1 Learning objectives

By the end of this chapter, you should be able to:

understand why sample percentages differ from population percentages
calculate and interpret the standard error of a percentage
construct confidence intervals using the empirical rule
correctly interpret what a confidence interval does and does not mean
distinguish between point estimates and interval estimates

12.2 Motivation: Should the politician run?

A politician is deciding whether to enter an election. In a village of 100,000 eligible voters, a survey of 2,500 people finds that 1,328 support the candidate, which is about 53%.

Should the politician run?

At first glance, the answer seems obvious: 53% is greater than 50%. But this conclusion depends on how accurate the estimate is. The key question is:

How far might the sample percentage be from the true population percentage?

Key idea

A sample percentage is only an estimate. It is subject to chance error due to random sampling.

12.3 From samples to uncertainty

The sample percentage, 53%, is a point estimate of the population percentage. But different random samples would give slightly different results.

To quantify this uncertainty, we use the standard error (SE).

For a percentage, the standard error is:

\[ SE(\%) = \frac{SD(\text{box})}{\sqrt{n}} \times 100 \]

But there is a problem: we do not know the standard deviation of the box/population.

12.3.1 Estimating variability from the sample

We go around this problem by using the sample itself to estimate variability.

In this case:

53% of observations are 1 (support)
47% are 0 (do not support)

The standard deviation of the sample, which we denote \(SD^+\) of such a 0–1 variable is approximately:

\[ SD^+ = (1-0) \sqrt{0.53 \times 0.47} \approx 0.5 \]

Using this, the standard error becomes approximately:

\[ SE(\%) = \frac{SD^+}{n} \times 100\% \approx 1\% \]

So the estimate of 53% is likely to be off by about 1%.

Interpretation

Being off by 3% would correspond to about 3 standard errors, which is unlikely.

Conclusion: The candidate is very likely above 50% support and may reasonably decide to run.

12.4 Confidence intervals

Sometimes instead of a point estimate, we can provide an interval estiate to account for uncertainty.

A confidence interval provides a range of plausible values for the population percentage.

Using the empirical rule:

Confidence Interval

\(\pm 1\) SE gives about 68% confidence
\(\pm 2\) SE gives about 95% confidence
\(\pm 3\) SE gives about 99.7% confidence

In our example:

\[ 53\% \pm 2\% = [51\%, 55\%] \]

That is, a 95% confidence interval is approximately:

\[ \text{estimate} \pm 2 \times SE \]

12.4.1 A common misunderstanding

It is tempting to say:

There is a 95% chance that the true percentage lies between 51% and 55%.

This is incorrect.

Common pitfall

The population percentage is fixed. It does not vary.

The interval varies because it depends on the sample.

The correct interpretation is:

If we repeatedly took samples and constructed confidence intervals, about 95% of those intervals would contain the true population percentage.

12.4.2 Confidence intervals as repeated sampling

Imagine drawing 100 different samples of 2,500 voters.

Each sample gives:

a different percentage
a different confidence interval

Some intervals will include the true value. Others will miss it.

Key idea

A 95% confidence level means that about 95 out of 100 such intervals will capture the true parameter.

12.4.3 Simulation: One poll

Let us simulate one poll where the true population support rate is 80%, and each sample contains 2,500 observations.

set.seed(123)

n <- 2500
p_true <- 0.80

sample_draw <- rbinom(n, size = 1, prob = p_true)
p_hat <- mean(sample_draw)
se_hat <- sqrt(p_hat * (1 - p_hat) / n)

p_hat

[1] 0.8032

se_hat

[1] 0.007951598

The estimated percentage is:

round(100 * p_hat, 2)

[1] 80.32

The estimated standard error in percentage points is:

round(100 * se_hat, 2)

[1] 0.8

The approximate 95% confidence interval is:

lower <- p_hat - 2 * se_hat
upper <- p_hat + 2 * se_hat

round(100 * c(lower, upper), 2)

[1] 78.73 81.91

12.4.4 Simulation: 100 confidence intervals

Now let us simulate 100 independent samples, each of size 2,500, from a population where the true percentage is 80%.

  sample_id  p_hat      se_hat     lower     upper covers
1         1 0.8032 0.007951598 0.7872968 0.8191032   TRUE
2         2 0.8008 0.007987975 0.7848241 0.8167759   TRUE
3         3 0.8020 0.007969843 0.7860603 0.8179397   TRUE
4         4 0.8144 0.007775671 0.7988487 0.8299513   TRUE
5         5 0.7932 0.008100216 0.7769996 0.8094004   TRUE
6         6 0.7932 0.008100216 0.7769996 0.8094004   TRUE

How many of these 100 intervals cover the true population percentage?

sum(results$covers)

[1] 98

What percentage of intervals cover the true value?

mean(results$covers) * 100

[1] 98

12.4.5 Plot: 100 confidence intervals

How to read the figure

Each horizontal line is a 95% confidence interval from one sample.

The vertical line marks the true population percentage, 80%.
Intervals that cross the vertical line capture the true value.
Intervals that do not cross it miss the true value.

12.4.6 Why some intervals fail

Even when the method is correct, some intervals will fail to include the true value. That is not a mistake. It is exactly what a 95% confidence level implies: about 5% of intervals will miss the true value just by chance.

Important

A confidence procedure can be reliable without guaranteeing success in every single sample.

12.4.7 Deduction vs induction

There are two directions of reasoning in statistics:

Deduction: from population to sample
Induction: from sample back to population

Confidence intervals are a tool for inductive reasoning.

Key idea

Statistical inference is about learning about the population from a sample.

12.5 Chapter summary

Sample percentages vary due to chance error.
The standard error measures the typical size of this variation.
Confidence intervals provide a range of plausible values for the population parameter.
A 95% confidence interval captures the true value in about 95% of repeated samples.
The interval varies across samples; the parameter does not.

12.6 Exercises

1. Graduate earnings

A simple random sample of 100 graduates from a certain college found that 48 were earning Baht 50,000 per month or more.

Estimate the percentage of all graduates from that college who earn Baht 50,000 per month or more.
Attach a “give-or-take” number to your estimate by computing the standard error of the percentage.
Write one sentence interpreting your result in plain language.

2. Standard error from a box model

Suppose there is a box of 100,000 tickets, each marked either 0 or 1. In fact, 20% of the tickets are 1s.

A sample of 400 draws is taken at random.

What is the expected percentage of 1s in the sample?
Calculate the standard error of the percentage of 1s.
Explain briefly why the standard error becomes smaller when the sample size increases.

3. Internet use among subscribers

TRUE company serves 1,000,000 subscribers. As part of a market survey, a simple random sample of 3,600 people was taken. In the sample:

1,080 use the internet for banking
720 use the internet for reading the news
1,440 use the internet for online games

Answer the following:

What percentage of customers use the internet for:
- banking,
- reading the news,
- online games?
Compute the standard error for each percentage.
For internet banking, what is the approximate probability that the sample percentage lies between 28% and 32%?

4. Polling support for a political party

In a poll of 200 voters, 92 said that they would vote for the YY Party.

Estimate the percentage of all voters who support the YY Party.
Construct an approximate 95% confidence interval for the true population percentage.
Do you think this estimate is accurate enough to advise the YY Party to enter the election?
How could the accuracy of the estimate be improved?

5. Percentages in a town population

According to a census, a certain town has 100,000 people aged 18 and over. Of these:

60% are married
10% earn more than Baht 75,000 per month
20% have university degrees

A simple random sample of 1,600 people will be drawn from this population.

(a) Marriage

To study marriage, a box model is needed.

Should the box contain 1,600 tickets or 100,000 tickets? Explain.
What is the chance that 58% of the sample are married?

(b) High income

To study the proportion earning more than Baht 75,000 per month:

Should each ticket in the box show the person’s exact income? Explain.
What is the chance that 11% or more of the sample earn above Baht 75,000 per month?

(c) University degree

Find the chance that between 19% and 21% of the sample have a university degree.

6. Health insurance

A survey of 400 people finds that 80 have health insurance.

Estimate the percentage of the population with health insurance.
Construct a 90% confidence interval for the population percentage.
Explain what your interval means.

7. Consumer survey: televisions and VCRs

A utility company serves 50,000 households. As part of a survey of consumer attitudes, it takes a simple random sample of 750 households. The average number of television sets in the sample is 1.86, and the sample SD is 0.80.

(a) Average number of television sets

If possible, construct a 95% confidence interval for the average number of television sets in all 50,000 households.
If this is not possible, explain why.

(b) Percentage with VCRs

Out of the 750 households in the survey, 351 have VCRs.

Estimate the percentage of all 50,000 households with VCRs.
If possible, construct a 99.7% confidence interval for that percentage.

(c) Households without a television

Among those surveyed, 749 households have at least one television set.

Estimate the percentage of households with no television set.
If possible, construct a 68% confidence interval for the percentage of all 50,000 households with no television set.
If this is not possible, explain why.

8. Red and blue marbles

A box contains 1 red marble and 99 blue marbles. Ten marbles are drawn at random with replacement.

Find the expected number of red marbles in the 10 draws.
Find the standard error of the number of red marbles.
What is the chance of drawing fewer than 0 red marbles?
Use the normal curve to estimate this chance.
Does the probability histogram for the number of red marbles look like a normal curve? Explain.

12.6.1 Conceptual question

9. Point estimates and confidence intervals

For each statement below, say whether it is true or false, and explain briefly.

A sample percentage is usually equal to the population percentage.
A larger sample gives a smaller standard error.
A 95% confidence interval means there is a 95% chance that the true value is inside the interval.
If we repeatedly draw samples and construct 95% confidence intervals, about 95% of them will contain the true population value.

(a) Why do statisticians usually prefer a confidence interval to a point estimate when reporting survey results?

(b) Explain in your own words: Why is it incorrect to say “there is a 95% chance the true value lies in the interval?