Assignment overview | ECON3500: Econometrics and Applications

Problem set 5

Mon, 06 Apr 2026 00:00:00 +0000

Welcome

Our final problem set, 😭 covering Chapters 10 and 12.

See the exercises below, or you can download them as a pdf.

You should not need any textbook tables to complete this problem set. Any datasets, scans, articles, and formulas you need are linked directly in the questions below or included in the hints.

What do I submit?

Your written up answers to exercise questions. If you work on a piece of paper, please scan using some sort of phone software (like Microsoft Lens or Adobe Scan) rather than just taking a picture.
A do-file that runs your Stata analysis.
A log file that includes the output from running your do-file.

Exercises

In 1985, neither Florida nor Georgia had laws banning open alcohol containers in vehicle passenger compartments. By 1990, Florida had passed such a law, but Georgia had not.

a. Suppose you collect random samples of the driving-age population in both states, for 1985 and 1990. Let $arrest$ be a binary variable equal to one if a person was arrested for drunk driving during the year. Without controlling for any other factors, write down a linear probability model that allows you to test whether the open container law reduced the probability of being arrested for drunk driving. Which coefficient measures the effect of the law?

b. Why might you want to control for other factors in the model? What might some of these factors be?

c. Now, suppose that you can only collect data for 1985 and for 1990 at the county level for the two states. The dependent variable would be the fraction of licensed drivers arrested for drunk driving during the year. How does this data structure differ from the individual-level data described in part (a)? What econometric method would you use?
For this exercise, use JTRAIN.dta to determine the effect of a job training grant on hours of job training per employee. The basic model for the three years is the following: $$\begin{split} hrsemp_{it} &= \beta_0 + \delta_1 d88_t + \delta_2 d89_t +\ & \beta_1 grant_{it} + \beta_2 grant_{i,t-1} + \beta_3 log(employ_{it}) + a_i + u_{it} \end{split}$$

a. Estimate the equation using first differencing. How many firms are used in the estimation? How many total observations would be used if each firm had data on all variables (in particular, $hrsemp$) for all three time periods?

b. Interpret the coefficient on $grant$, and comment on its significance.

c. Is it surprising that $grant_{-1}$ is insignificant? Explain.

d. Do larger firms train their employees more or less, on average? How big are the differences in training?
Use CRIME4.dta for this exercise, and see example 13.9 in this poor-quality scanned upload.

a. Replicate the results in Example 13.9.

b. Re-estimate the unobserved effects model for crime in Example 13.9, but use fixed effects rather than differencing. Are there any notable sign or magnitude changes in the coefficients? What about statistical significance?

c. Add the logs of each wage variable in the data set and estimate the model by fixed effects. How does including these variables affect the coefficient on the criminal justice variables in part (b)?

d. Do the wage variables in part (c) have the expected sign? Are they jointly significant?
```
 
```
SW-12.6 In an instrumental variable regression model with one regressor, $X_i$, and one instrument, $Z_i$, the regression of $X_i$ onto $Z_i$ has $R^2 = 0.05$ and $n = 100$. Is $Z_i$ a strong instrument?¹ Would your answer change if $R^2 = 0.05$ and $n = 500$?

SW-12.9 A researcher is interested in the effect of military service on human capital. She collects data from a random sample of 4000 workers aged 40 and runs the OLS regression $Y_i = \beta_0 + \beta_1X_i + u_i$, where $Y_i$ is a worker’s annual earnings and $X_i$ is a binary variable equal to 1 if the person served in the military and is equal to 0 otherwise.

a. Explain why the OLS estimates are likely to be unreliable. (Hint: Which variables are omitted from the regression? Are they correlated with military service?)

b. During the Vietnam war there was a draft in which priority for the draft was determined by a national lottery. The days of the year were randomly re-ordered 1 through 365. (Those whose birthdays were ordered first were drafted before those with birthdates ordered second, and so forth.) Explain how the lottery might be used as an instrument to estimate the effect of military service on earnings. For more about this issue, see Joshua D. Angrist’s paper “Lifetime Earnings and the Vietnam Era Draft Lottery: Evidence from Social Security Administration Records,” American Economic Review, June 1990: 313–336.

SW-E12.2 Does viewing a violent movie lead to violent behavior? If so, the incidence of violent crimes, such as assault, should rise following the release of a violent movie that attracts many viewers. Alternatively, movie viewing may substitute for other activities, such as alcohol consumption, that lead to violent behavior, so that assaults should fall more when more viewers are attracted to the cinema. Use the data file Movies.dta, which contains data on the number of assaults and movie attendance for 516 weekends from 1995 through 2004.² A detailed description is given here. The data set includes weekend US attendance for strongly violent movies (such as Hannibal), mildly violent movies (such as Spiderman), and non-violent movies (such as Finding Nemo). The data also includes the count of the number of assaults for the same weekend in a subset of counties in the United States. Finally, the data set includes indicators for year, month, whether the weekend is a holiday, and various measures of the weather.

a. Regress the logarithm of the number of assaults ($ln_assaults= ln(assaults)$) on the year and month indicators. Is there evidence of seasonality in assaults? That is, do there tend to be more assaults in some months than others? Explain.

b. Now, regress total movie attendance ($attend = attend_v + attend_m + attend_n$) on the year and month indicators. Is there evidence of seasonality in movie attendance? Explain.

c. Regress $ln_assaults$ on $attend_v$, $attend_m$, $attend_n$, the year and month indicators, and the weather and holiday control variables available in the data set.
1. Based on the regression, does viewing a strongly violent movie increase or decrease assaults? By how much? Is the estimated effect statistically significant?
2. Does attendance at strongly violent movies affect us all differently than attendance at moderately violent movies? Differently than attendance at non-violent movies?
3. A strongly violent blockbuster movie is released and weekend attendance at strongly violent movies increases by 6 million; meanwhile, attendance falls by 2 million for moderately violent movies and by 1 million for non-violent movies. What is the predicted effect on assault? Construct a 95% confidence interval for the change in assault.³
d. It is difficult to control for all the variables that affect assaults and that might be correlated with movie attendance. For example, the effect of the weather on assaults and movie attendance is only crudely approximated by the weather variables in the data set. However, the data set does include a set of instruments $pr_attend_v$, $pr_attend_m$, and $pr_attend_n$, that are correlated with attendance but are (arguably) uncorrelated with weekend-specific factors such as the weather that affect both assaults and movie attendance. These instruments use historical attendance patterns, not information on a particular weekend, to predict a film’s attendance in a given weekend. For example, if a film’s attendance is high in the second week of its release, then this could be used to predict that attendance was also high in the first week of its release. The details of the construction of these instruments are available in the Dahl and DellaVigna paper. Run the regression from part c, including year, month, holiday, and weather controls, but now using the instruments for attendance. Use this regression to re-answer the questions from part c: c(1)- c(3).

e. Based on your analysis, what do you conclude about the effects of violent movies on short-run violent behavior?

Hint: Use the first-stage F-statistic and the usual rule of thumb that instruments with $F < 10$ are weak. ↩︎
These are aggregated versions of data provided by Gordon Dahl and Stefano DellaVigna, used in their paper, “Does Movie Violence Increase Violent Crime?". ↩︎
Hint: Review section 7.3 and material surrounding equations 8.7 and 8.8. ↩︎

Problem set 4

Fri, 13 Mar 2026 00:00:00 +0000

Welcome

Chapters 8 and 9 problems! Enjoy!

See the exercises below, or you can download them as a pdf.

What do I submit?

Your written up answers to exercise questions. If you work on a piece of paper, please scan using some sort of phone software (like Microsoft Lens or Adobe Scan) rather than just taking a picture.
A do-file that runs your Stata analysis (for question 8).
A log file that includes the output from running your do-file (for question 8).

Exercises

The following equation describes the median housing price in a community in terms of amount of pollution ($nox$ for nitrous oxide) and the average number of rooms in houses in the community ($rooms$):

$log(price) = \beta_0 + \beta_1log(nox) + \beta_2rooms + u$

a. What are the probable signs of $\beta_1$ and $\beta_2$? What is the interpretation of $\beta_1$? Explain.

b. Why might $nox$ [or more precisely, $log(nox)$] and $rooms$ be negatively correlated? If this is the case, does the simple regression of $log(price)$ on $log(nox)$ produce an upward or a downward biased estimator of $\beta_1$?

c. Using data, the following equations were estimated:

$\widehat{log(price)} = 11.71 - 1.043 log(nox)$, $n = 506$, $R^2 = 0.264$ $\widehat{log(price)} = 9.23 - 0.718 log(nox) + 0.306 rooms$, $n = 506$, $R^2 = 0.514$

Is the relationship between the simple and multiple regression estimates of the elasticity of $price$ with respect to $nox$ what you would have predicted, given your answer in part (b)? Does this mean that 0.718 is definitely closer to the true elasticity than 1.043?

Read the box “The Return to Education and the Gender Gap” in Section 8.3 of your textbook (Stock & Watson).

a. Consider a man with 16 years of education and 2 years of experience. Use the results from column (4) of Table 8.1 and the method in Key Concept 8.1 to estimate the expected change in the logarithm of average hourly earnings (AHE) associated with an additional year of experience.

b. Explain why your answer to (a) does not depend on the region he is from.

c. Repeat (a), assuming 10 years of experience.

To answer this question, refer to Table 8.3: Nonlinear Regression Model of Test Scores in your textbook:

a. A researcher suspects that the effect of % Eligible for subsidized lunch has a nonlinear effect on test scores. In particular, he conjectures that increases in this variable from 10% to 20% have little effect on test scores but that changes from 50% to 60% have a much larger effect. i. Describe a nonlinear specification that can be used to model this form of nonlinearity. ii. How would you test whether the researcher’s conjecture was better than the linear specification in column (7) of Table 8.3?

b. A researcher suspects that the effect of income on test scores is different in districts with small classes than in districts with large classes. i. Describe a nonlinear specification that can be used to model this form of nonlinearity.

Labor economists studying the determinants of women’s earnings discovered a puzzling empirical result. Using randomly selected employed women, they regressed earnings on the women’s number of children and a set of control variables (age, education, occupation, and so forth). They found that women with more children had higher wages, controlling for these other factors. Explain how sample selection might be the cause of this result. (Hint: Notice that women who do not work outside the home are missing from the sample.) [This empirical puzzle motivated James Heckman’s research on sample selection that led to his 2000 Nobel Prize in Economics. See Heckman (1974)]
This question uses directed acyclic graphs (DAGs), which we will cover in class. You may also find it helpful to read Huntington-Klein, The Effect, Chapter 8: Causal Paths and Closing Back Doors, especially Sections 8.3–8.5.

Consider the relationship between a woman’s number of children and her earnings from question 4.

a. Draw a DAG that includes the following variables: Earnings, Number of Children, Decision to Work Outside the Home, and Ability/Motivation. Add arrows representing plausible causal relationships. For each arrow, write one sentence explaining why you included it.

b. Is “Decision to Work Outside the Home” a confounder, a collider, or a mediator on the path between Number of Children and Earnings? Explain.

c. When researchers study only employed women, they are conditioning on “Decision to Work.” Using your DAG, explain why this could produce a spurious positive relationship between number of children and earnings — even if children have no direct causal effect on earnings.

The demand for a commodity is given by $Q = \beta_0 + \beta_1 P + u$, where $Q$ denotes quantity, $P$ denotes price, and $u$ denotes factors other than price that determine demand. Supply for the commodity is given by $Q = \gamma_0 + \gamma_1P + v$, where $v$ denotes factors other than price that determine supply. Suppose $u$ and $v$ both have a mean of 0, have variances $\sigma^2_u$ and $\sigma^2_v$, and are mutually uncorrelated.

a. Solve the two simultaneous equations to show how Q and P depend on u and v. (Hint: In equilibrium, quantity supplied equals quantity demanded. Set the two equations equal and solve for P in terms of u and v. Then substitute back to find Q.)

b. Derive the means of P and Q. (Hint: Use your answers from part (a) and the fact that $E(u) = E(v) = 0$.)

c. (Optional) Derive the variance of P, the variance of Q, and the covariance between Q and P.

Revisit the box “The Return to Education and the Gender Gap” in Section 8.3 of your textbook (Stock & Watson). Discuss the internal and external validity of the estimated effect of education on earnings.
Use the dataset CollegeDistance.dta (described in Empirical Exercise AEE 4.3) to answer the following questions.

a. Run a regression of $ED$ on $Dist$, $Female$, $Bytest$, $Tuition$, $Black$, $Hispanic$, $Incomehi$, $Ownhome$, $DadColl$, $MomColl$, $Cue80$, and $Stwmfg80$. If $Dist$ increases from 2 to 3 (that is, from 20 to 30 miles), how are years of education expected to change? If $Dist$ increases from 6 to 7 (that is, from 60 to 70 miles), how are years of education expected to change?

b. Run a regression of $ln(ED)$ on $Dist$, $Female$, $Bytest$, $Tuition$, $Black$, $Hispanic$, $Incomehi$, $Ownhome$, $DadColl$, $MomColl$, $Cue80$, and $Stwmfg80$. If $Dist$ increases from 2 to 3 (from 20 to 30 miles), how are years of education expected to change? If $Dist$ increases from 6 to 7 (from 60 to 70 miles), how are years of education expected to change?

c. Run a regression of $ED$ on $Dist$, $Dist^2$, $Female$, $Bytest$, $Tuition$, $Black$, $Hispanic$, $Incomehi$, $Ownhome$, $DadColl$, $MomColl$, $Cue80$, and $Stwmfg80$. If $Dist$ increases from 2 to 3 (from 20 to 30 miles), how are years of education expected to change? If $Dist$ increases from 6 to 7 (from 60 to 70 miles), how are years of education expected to change?

d. Do you prefer the regression in (c) to the regression in (a)? Explain.

e. Add the interaction term $DadColl \times MomColl$ to the regression in (c). What does the coefficient on the interaction term measure?

f. Mary, Jane, Alexis, and Bonnie have the same values of $Dist$, $Bytest$, $Tuition$, $Female$, $Black$, $Hispanic$, $Incomehi$, $Ownhome$, $Cue80$, and $Stwmfg80$. Neither of Mary’s parents attended college. Jane’s father attended college, but her mother did not. Alexis’s mother attended college, but her father did not. Both of Bonnie’s parents attended college. Using the regressions from (e): i. What does the regression predict for the difference between Jane’s and Mary’s years of education? ii. What does the regression predict for the difference between Alexis’s and Mary’s years of education? iii. What does the regression predict for the difference between Bonnie’s and Mary’s years of education?

g. Is there any evidence that the effect of $Dist$ on $ED$ depends on the family’s income?

h. After running all these regressions (and any others that you want to run), summarize the effect of $Dist$ on years of education.

Problem set 3

Tue, 17 Feb 2026 00:00:00 +0000

Welcome

Note that this is your last problem set before the next exam!

See the exercises below, or you can download them as a pdf. You can download the data file you need for question 6 here, along with information on the variable definitions here

What do I submit?

Your written answers to exercise questions. If you work on a piece of paper, please scan using some sort of phone software (like Microsoft Lens or Adobe Scan) rather than just taking a picture.
A do-file that runs your Stata analysis (for questions 6 and 7).
A log file that includes the output from running your do-file (for question 6 and 7).

Exercises

Suppose that we want to estimate the effects of alcohol consumption ($alcohol$) on college grade point average ($colGPA$). In addition to collecting information on alcohol consumption and grade point averages, we also obtain attendance information (say, percentage of lectures attended, $attend$). A standardized test score (say, $SAT$) and high school GPA ($hsGPA$) are also available.

a. Should we include $attend$ along with alcohol as explanatory variables in a multiple regression model? What would be the interpretation of $\beta_{alcohol}$ if we did?

b. Should $SAT$ and $hsGPA$ be included as explanatory variables? Explain.

A researcher plans to study the causal effect of police on crime, using data from a random sample of U.S. counties. She plans to regress the county’s crime rate on the (per capita) size of the county’s police force.

a. Explain why this regression is likely to suffer from omitted variable bias. Which variables would you add to the regression to control for important omitted variables?

b. Use your answer to (a) and the expression for omitted variable bias (from the slides or textbook) to determine whether the regression will likely over- or underestimate the effect of police on the crime rate. (That is, is $\hat{\beta_1}>\beta_1$, or that $\hat{\beta_1} < \beta_1$?)

Critique each of the following proposed research plans. Your critique should explain any problems with the proposed research and describe how the research plan might be improved. Include discussion of any additional data that needs to be collected, and the appropriate statistical techniques for analyzing those data.

a. A researcher is interested in determining whether a large aerospace firm is guilty of gender bias in setting wages. To determine potential bias, the researcher collects salary and gender information for all of the firm’s engineers. The researcher then plans to conduct a “difference in means” test to determine whether the average salary for women is significantly less than the average salary for men.

b. A researcher is interested in determining whether time in prison has a permanent effect on a person’s wage rate. He collects data on a random sample of people who have been out of prison for at least 15 years. He collects similar data on a random sample of people who have never served time in prison. The data set includes information on each person’s current wage, education, age, ethnicity, gender, tenure (time in current job), occupation, and union status, as well as whether the person has ever been incarcerated. The researcher plans to estimate the effect of incarceration on wages by regressing wages on an indicator variable for incarceration, including in the regression the other potential determinants of wages such as education, tenure, union status, and so on.
Consider a dataset that contains information on 4700 full-time full-year workers. The highest educational achievement for each worker was either a high school diploma or a bachelor’s degree. The worker’s ages ranged from 25 to 45 years. The data set also contains information on the region of the country where the person lived, marital status, and number of children. See below for variable definitions.

a. Is the college-high school earnings difference estimated from this regression statistically significant at the 5% level? Construct a 95% confidence interval of the difference.

b. Do there appear to be important regional differences in hourly earnings? Use an appropriate hypothesis test to explain your answer.

Variable	Definition
AHE	average hourly earnings (in 2005 dollars)
College	1 if college, 0 if high school
Female	1 if female, 0 if male
Age	age (in years)
Ntheast	1 if Region = Northeast, 0 otherwise
Midwest	1 if Region = Midwest, 0 otherwise
South	1 if Region = South, 0 otherwise
West	1 if Region = West, 0 otherwise

Consider the regression results below and do the following:

a. Construct the $R^2$ for each of the regressions

b. Construct the homoskedasticity-only $F$-statistic for testing $\beta_3 = \beta_4 = 0$ shown in column (5). Is the statistic significant at the 5% level?

c. Construct a 99% confidence interval for $\beta_1$ for the regression in column (5)

Download the dataset growth.dta, which contains data on average growth rates from 1960 through 1995 for 65 countries, along with variables that are potentially related to growth. You can download a detailed description of all variable names is available here. For all questions, exclude Malta, which has an extremely high trade share.

a. Write the population model for a regression of growth on tradeshare, yearsschool, rev_coups, assassinations, and rgdp60. Then estimate it using OLS with heteroskedasticity-robust standard errors.

b. What is the value of the coefficient on rev_coups? Interpret the value of this coefficient. Is it large or small in a real-world sense?

c. Use the regression to predict the average annual growth rate for a country that has average values for all regressors.

d. Test whether the political variables rev_coups and assassinations, taken as a group, can be omitted from the regression. What is the p-value of the F-statistic?

e. After running your regression, pick one country in your sample. Report its actual value of growth, its fitted (predicted) value, and its residual. In one sentence, what does that residual mean?

f. Under what assumptions is the OLS estimator BLUE? For this regression, which of those assumptions are likely to hold, which are likely violated, and for which would you need more information? (One short sentence per assumption is enough.)
Consider the regression in Question 6: growth on tradeshare, yearsschool, rev_coups, assassinations, and rgdp60.

a. Give an example of a variable that is likely to be in the error term and would not violate the zero conditional mean assumption. Explain in one sentence.

b. Give an example of a variable that is likely to be in the error term and would violate the zero conditional mean assumption. Explain in one sentence.

Lab 8: Instrumental variables

Tue, 14 Apr 2026 00:00:00 +0000

Print-friendly pdf

It’s our final lab of the semester!

Materials

voucher.dta
Do-file template econ3500_lab_template.do

Objectives

Today we’re going to work with voucher.dta, a dataset of student performance from Rouse (1998). She measures the impact of private school vouchers on student achievement.

By the end of this lab, you should be able to complete the following tasks in Stata:

Estimate instrumental variable specifications and interpret them.
Output regression results using outreg2

Why instrumental variables?

In Labs 6 and 7, we dealt with endogeneity — situations where our key independent variable is correlated with the error term, usually because of omitted variables, measurement error, or reverse causality. Fixed effects (Lab 7) solve this when the problem comes from time-invariant confounders.

Instrumental variables (IV) offer another approach: find a variable (the “instrument”) that affects $Y$ only through $X$. This instrument provides a source of exogenous variation in $X$ that we can use to estimate the causal effect. The two requirements for a valid instrument are:

Relevance: The instrument must be correlated with the endogenous variable ($X$).
Exclusion restriction: The instrument must affect $Y$ only through $X$ (not directly).

Data context

The data come from an evaluation of the Milwaukee Parental Choice Program, which randomly offered school vouchers to students via a lottery. The final measure of student performance is mnce, their math test score in 1994 (after up to four years in a private school). We also have baseline performance: their math test score in 1990 (mnce90). The variable choiceyrs is the number of years actually enrolled in a private school, and selectyrs is the number of years a student was selected (via lottery) to receive a voucher.

The lottery creates a natural instrument: being selected for a voucher (which is random) affects the number of years enrolled in a private school, but shouldn’t directly affect test scores through any other channel.

Variables we’ll use

variable	meaning	notes
`mnce`	math score in 1994	outcome variable
`mnce90`	math score in 1990	baseline performance
`choiceyrs`	years enrolled in a choice school	endogenous variable
`selectyrs`	years selected to receive a voucher	instrument
`choiceyrs1`–`choiceyrs4`	dummies for years in choice school	used in Q9
`selectyrs1`–`selectyrs4`	dummies for years selected for voucher	used in Q9
`black`	Black indicator
`hispanic`	Hispanic indicator
`female`	female indicator

Key commands

command	description
`ivregress 2sls y (x = z) controls, robust`	IV regression using two-stage least squares
`ivregress 2sls y (x = z) controls, robust first`	Same, reporting first-stage results
`predict yhat, xb`	Generate predicted values from the previous regression
`testparm varname`	Test significance of a coefficient (F-statistic)
`outreg2 using file.xls, replace`	Export regression results to Excel (first column)
`outreg2 using file.xls, append`	Add a column to an existing results table

Conducting IV regressions with `ivregress`

General form:

ivregress estimator depvar [varlist1] (varlist2 = varlist_iv) [if] [in] [weight] [, options]

estimator is where we will type 2sls
depvar is your dependent variable
You can include other explanatory variables before or after the parentheses, [varlist1]
In the parentheses, write your endogenous ($x$) then your instrument ($z$) — these can be lists!
The rest of it is just as you’re used to

Example:

To estimate the following two-stage least squares equation: $$ rent = \beta_0 + \beta_1 \widehat{hsngval} + \beta_2 pcturban + u$$ where $\widehat{hsngval}$ is predicted from the following first-stage equation $$ hsngval = \alpha_0 + \alpha_1 faminc + \alpha_2 pcturban + v $$

webuse hsng2
ivregress 2sls rent (hsngval = faminc) pcturban, robust

You can add , first to report the first-stage results:

ivregress 2sls rent (hsngval = faminc) pcturban, robust first

Outputting your results with `outreg2`

We are very good at reading raw Stata output. But raw Stata output has no place in our papers. How do we make it pretty? There are lots of ways, including putexcel, which lets you create customizable Excel tables with your outputs (good for descriptive statistics), and estout, which does the same thing but is more regression oriented.

Personally, I like outreg2, because it’s easy to set up and use. So that’s what we’ll use!

Installation required: outreg2 is a user-created package, which means you have to install it first:

ssc install outreg2

You only need to do this once per computer. If you get an error that outreg2 is already installed, that’s fine — just keep going.

You’ll run outreg2 after estimating a regression. It takes your results and saves them to a table. You can run it multiple times and generate columns of results within the same Excel sheet, which is pretty handy! The general format of outreg2 is this:

// You can copy and paste this into Stata, and it should work!
// Note that it will save to your working directory
sysuse auto, clear
// Specification 1
regress mpg foreign weight headroom trunk length turn displacement
outreg2 using myfile.xls, replace
// Specification 2 (add on)
regress mpg foreign weight headroom trunk length turn displacement, robust
outreg2 using myfile.xls, append

You can customize with lots of options! (See help outreg2, or check out these resources)

What sort of things?

Export directly to Word
- outreg2 using myfile, word replace
Add notes
- outreg2 using myfile, addnote(Dummy variables not shown)
Report only some variables
- outreg2 using myfile, keep(mpg foreign)
Modify number of decimal places
- outreg2 using myfile, dec(5)
You can use a loop to make a whole set of columns!

An example:


sysuse auto, clear
local r "replace"
forval num = 1/5 {
regress mpg weight headroom if rep78 == `num'
sum mpg if rep78 == `num'
local mean = `r(mean)'
outreg2 using myfile.xls, `r' keep(headroom) title("Sample Graph") nocons addtext("Rep78", `num') addstat("Mean", `mean') auto(2) bracket
local r "append"
}

Workflow overview

Load voucher.dta and start your log file.
Explore the data (summarize, describe).
Estimate OLS regressions (naive estimates).
Run the first stage and check instrument relevance.
Estimate IV models (by hand and with ivregress).
Compare OLS and IV results.
Create a summary table with outreg2.

Lab 8 Worksheet

What do I submit?

Your written-up answers to exercise questions (1)–(10). This can be typed or written out then scanned (or photographed), in any reasonable format.
The do-file you created that runs this analysis
A log file that contains the results from this exercise.
A table with your regression results (six columns, from outreg2). Include this with your written answers.

Use robust standard errors in all regressions.

Questions

In your do-file, start a log and open voucher.dta.
Summarize your data. Of the 990 students in the sample, how many were never awarded a voucher? How many had a voucher for all four years? How many actually attended a choice school for four years?

Hint: tab selectyrs and tab choiceyrs will show you the distribution.

Predict the relationship between choice school attendance and math scores by regressing math scores mnce (dependent variable) on number of years enrolled in a choice school choiceyrs (independent variable). What do you find? Is this what you expect? What happens if you add in the variables black, hispanic, and female? Write your results in equation form.
Why might choiceyrs be endogenous? Explain.
Now, estimate a regression of choiceyrs (dependent variable) on selectyrs (independent variable), including race/ethnicity and gender controls. Why is this a reasonable choice of an instrument? What is the F-statistic on selectyrs?

Hint: Use testparm selectyrs after the regression to get the F-statistic. A rule of thumb is that the F-statistic should be at least 10 for the instrument to be considered strong enough.

Based on the previous regression, use the predict command to generate a predicted $\widehat{choiceyrs}$. Estimate the regression of mnce on $\widehat{choiceyrs}$, including race/ethnicity and gender controls. Write the estimated equation. How does your result compare to your OLS estimate?

Reminder: The predict command generates fitted values from the most recently estimated regression. Run it immediately after the Q5 regression — before running anything else:

predict choiceyrs_hat, xb

Then use choiceyrs_hat as your independent variable in the second-stage regression.

Re-estimate a regression of mnce (dependent variable) on choiceyrs (independent variable) using selectyrs as an instrument for choiceyrs. This time, estimate the equation in one command line using ivregress 2sls. How do your results change, if at all?

Example syntax:

ivregress 2sls mnce (choiceyrs = selectyrs) black hispanic female, robust

Important: The coefficients from Q6 and Q7 should be the same, but the standard errors will differ. That’s because the manual approach (Q6) doesn’t correctly account for the fact that $\widehat{choiceyrs}$ is a generated regressor. ivregress adjusts the standard errors automatically — always use it in practice.

Repeat your IV analysis, but this time include a control for baseline achievement by adding mnce90. Write the results in equation form below. Do you find these results convincing? Explain.

Heads up: mnce90 is missing for many students — your sample will drop from 990 to about 328 observations. This is expected. Think about what it means for your results.

We can also use multiple instruments for multiple endogenous variables. The variables choiceyrs1, choiceyrs2, etc. are dummy variables indicating the different number of years a student could have been in a choice school. Similarly, selectyrs1, selectyrs2, etc. have a similar definition, but for being selected from the lottery.

Here, choiceyrs1 = 1 if the student attended a choice school for exactly 1 year, choiceyrs2 = 1 for exactly 2 years, and so on. The selectyrs1–selectyrs4 variables are defined analogously for lottery selection.

Estimate the following equation using IV: $$\begin{split} mnce &= \beta_0 + \beta_1 choiceyrs_1 + \beta_2 choiceyrs_2 + \beta_3 choiceyrs_3 + \beta_4 choiceyrs_4 + \ & \beta_5 black + \beta_6 hispanic + \beta_7 female + \beta_8 mnce90 + u \end{split}$$

Hint: Put all the endogenous variables on the left of the = and all the instruments on the right:

ivregress 2sls mnce ///
(choiceyrs1 choiceyrs2 choiceyrs3 choiceyrs4 = ///
selectyrs1 selectyrs2 selectyrs3 selectyrs4) ///
black hispanic female mnce90, robust

Finally, go back through your regressions in your do-file. After each regression (there should be six: OLS without controls, OLS with controls, IV by hand, IV using ivregress, IV with mnce90, and IV with multiple instruments), add a line of code to output the results to a Word or Excel file using outreg2.

Include a table with your results with your submission — there should be six columns in one table.

Hint: Use replace for the first regression and append for each subsequent one:

regress mnce choiceyrs, robust outreg2 using lab8_results.xls, replace regress mnce choiceyrs black hispanic female, robust outreg2 using lab8_results.xls, append

// … continue for remaining regressions

Submission checklist

Answers to questions (1)–(10)
Do-file with comments for each question
Log file that matches your do-file commands
outreg2 table (six columns)
Make sure your do-file includes log close at the end

References: Rouse, Cecilia Elena (1998), “Private School Vouchers and Student Achievement: An Evaluation of the Milwaukee Parental Choice Program,” The Quarterly Journal of Economics 113(2), 553-602.

Research Paper: Presentation

Thu, 09 Apr 2026 00:00:00 +0000

Due date is rolling: your slides are due the day of your assigned presentation. Because we’re running out of class, we can’t do extensions!

Overview

Let’s share our work! You’ll prepare and deliver a 6-8 minute presentation of your paper, with accompanying slides. This will be one presentation per paper. If you are working with a partner, it should be one set of slides, with both members contributing to the creation and delivery.

Our objectives here are the following:

To communicate the key elements off your paper clearly and concisely (which, in turn, will help advance your paper)
To share and receive feedback on areas of improvement before finalizing your papers
To share with your peers and learn what each of you has been working on!

Guidelines for presentation

Your presentation should cover the main elements of your paper:

Introduction
Research questions
Background/motivation/related literature
Data
Empirical strategy
Results
Limitations/discussion
Conclusion

It doesn’t need to be scripted, but you should have practiced, such that your presentation flows smoothly.
Slides should serve as a guide, not a substitute for your narration, so you should avoid reading directly off your slides
It should be between 6 and 8 minutes. I will cut you off at 8 minutes. I will do it.

Example

📥 Download this week's slides

Additional student examples are available on Brightspace (see Gated Resources)

Deliverables and due dates

Your due date is the day you signed up for via Doodle poll: April 28 or April 30 at 1:15pm

On Brightspace: Submit a copy of your slides before your presentation.

Presentation rubric

Printer-friendly PDF

Research Paper: Final Submission

Thu, 09 Apr 2026 00:00:00 +0000

Research Paper - Final Submission

Research papers are fairly formulaic, and that’s a good thing - it helps readers know where to look for information, depending on what they want to get out of it.

What should I submit?

Your paper is due at 11:59pm May 04. I can accept extensions only up to May 05, as there are external grading deadlines I need to meet.

You should submit the following (see rubric for details):

Final paper in pdf or docx format (must include an AI attribution statement — see rubric)
Stata do-file with all analysis you conducted
Stata log file with results for analysis conducted in your do-file.

I will grade your papers following the rubric. If you would like me to share comments, you must opt-in by filling out the feedback survey. If you do not fill it out, you will not receive feedback!

Review the research paper checklist for lots of suggestions.

Rubric

Total: 102 marks	100 = Excellent	80 = Adequate	60 = Marginal	40 = Poor
Motivation/Literature (18 marks)
Introduction	Introduction provides complete overview of paper, motivates research question using sources	Introduction provides some overiew of paper, motivation clear with limited sources	Introduction vague; motivation minimal	Incomplete introduction, no motivation
Research question	Research question well identified, specific	Research question stated, not specific	Research question vague, not answerable	Cannot identify research question in paper
Literature	Important literature discussed and linked to topic	Important literature included, not linked to research question/paper	Scattered lit. discussion, poorly linked to topic (missing or irrelevant papers)	Sparse literature, not linked to topic

Methodology/Analysis (30 marks)
Data	Clear discussion of data sources and any data cleaning; data cleaned appropriately	Data sources referenced but incomplete discussion; some data issues overlooked	Limited discussion of data	No discussion of data sources or cleaning
Empirical methods	Methodology discussed and empirical methods applied correctly	Methodology generally correct, with some issues overlooked	Major errors in empirical methods	Fundamental misunderstanding of empirical methods/no microdata used
Discussion of results	Results discussed and interpreted clearly	Results discussed, but inadequate interpretation	Results presented without interpretation	Poor discussion of results, no interpretation
Choice of evidence	Presented evidence addresses research question, is well utilized	Presented evidence related, only partially addresses research question	Evidence related, but not directly relevant to research question.	Evidence does not address research question
Figures and tables	Figures and tables appropriate to analysis, easy to interpret	Appropriate figures/tables included, difficult to interpret	Irrelevant figures/tables included or key figures/tables missing	Insufficient figures/tables, poorly presented

Limitations	Limitations discussed and minimized through analysis	Limitations discussed, few steps to minimize	Incomplete discussion of limitations	No discussion of limitations
Conclusions/interpretation (18 marks)
Conclusions	Clear presentation of conclusions, qualifications, consequences, and contributions	Conclusions established, limited discussion implications and contributions	Fails to make clear conclusions, limited discussion of interpretation/contributions	Cannot discern conclusions
Critical thinking	Demonstrates independent and critical thinking	Demonstrates some independent and critical thinking	Limited evidence of independent and critical thinking	No evidence of independent and critical thinking
Argumentation	Assertions are qualified and well supported	Most assertions are qualified and well supported	Assertions are overly strong or unsupported	Assertions made in contrast to evidence or without evidence
Written presentation (24 marks)
Organization	Well organized, easy to understand	Good organization, some parts out of place	Unclear organization	Disorganized, impedes understanding
Writing style	Clear and easy to read	Awkward or wordy writing, clear planning	Readable but difficult to follow	Difficult to understand
Grammar	Few grammatical and typographical errors	Some grammatical and typographical errors, but do not impede understanding	Moderate grammatical errors/typos	Frequent errors impede understanding
Formatting	Meets all formatting requirements	Minor deviation from formatting requirements	Exceeds page limit/major deviation from formatting requirements	Formatting requirements completely disregarded
Replication code (10 marks)
Do-files and log	Well-documented, easy to read	Detailed documentation, somewhat confusing	Unclear documentation	Little to no documentation

AI Attribution (2 marks)
AI use statement	Statement identifies all AI tools used and their specific purposes (or explicitly states no AI was used)	—	—	No statement included, or statement does not identify tools and purposes

Lab 7: Difference in differences

Mon, 06 Apr 2026 00:00:00 +0000

Print-friendly pdf

Materials

banks.dta
nsly_marijuana.dta
Do-file template econ3500_lab_template.do

Objectives

There are two separate parts to this lab — a set of data for working with difference-in-differences models, and another set for working with fixed-effects models.

By the end of this lab, you should be able to complete the following tasks in Stata:

Estimate and interpret difference-in-differences models
Estimate panel data models using dummy variables
Interpret panel data models

What is panel data?

Up to now, we’ve worked with cross-sectional data — one observation per person (or state, or county) at a single point in time. In this lab, we’ll work with panel data (also called longitudinal data), where we observe the same individuals or units across multiple time periods.

Panel data lets us control for characteristics of each unit that don’t change over time — even ones we can’t directly measure — by comparing each unit to itself over time. This is the key idea behind fixed effects models.

What is difference-in-differences?

Difference-in-differences (DiD) is a method for estimating causal effects when one group is exposed to a treatment and another is not. The idea: compare how the outcome changed over time for the treatment group vs. the control group. The first difference removes time-invariant characteristics of each group; the second difference removes common time trends. What’s left is the estimated treatment effect — if the two groups would have trended the same way absent the treatment.

Key commands

command	description
`xtset panelvar timevar`	Declare your data as a panel (e.g., `xtset id year`)
`xtreg y x, fe`	Panel regression with fixed effects on `panelvar`
`xtreg y x, fe cluster(panelvar)`	Same, with clustered standard errors
`i.varname`	Add fixed effects for every value of `varname`
`xi: reg y i.varname`	Same as above, but works with string variables
`areg y x, absorb(varname)`	Absorb fixed effects (estimated but not reported)

Using `xtset` and `xtreg`

The xtset command tells Stata that you have panel data. For example, if you have individual and year data, then you would enter xtset id year, or whatever the appropriate variable names are.

General format: xtset panelvar timevar

After declaring your panel with xtset:

Use xtreg instead of regress for panel regression. Everything else proceeds as normal.
Add ,fe to estimate a fixed effects model, where the fixed effects are the panelvar variable you declared.
Add cluster(panelvar) to cluster standard errors at the panel level (accounts for correlation within units over time).

For example: xtreg income education i.year, fe cluster(id) regresses income on education with individual fixed effects (from xtset) and year fixed effects (from i.year), clustering standard errors at the individual level.

Adding other fixed effects

You can add fixed effects to a model more generally with the i. prefix or areg. A few examples:

xi: reg income i.educ i.bpl, robust
reg income i.educ i.bpl, robust
areg income i.educ, robust absorb(bpl)

xi: — this prefix is necessary for adding i. variables if the variables are in string form. You can also use it to do fancier interactions with fixed effects, like xi: reg income i.educ*i.bpl, robust
You can exclude the prefix and just do i.var to create indicator variables so long as your variable is numeric
You can use areg to “absorb” a set of fixed effects — they will not be reported in your output, but they will be estimated. This method is less efficient than xtreg because you use up degrees of freedom.

Workflow overview

Load a dataset and start your log file.
Explore the data structure (describe, browse, tab).
For Part A: Calculate the DiD estimator by hand, then estimate it as a regression.
For Part B: Declare your panel data and estimate fixed-effects models.
Compare results across specifications and interpret.
Answer the worksheet questions.

Lab 7 Worksheet

What do I submit?

Your written-up answers to exercise questions (1)–(18). This can be typed or written out then scanned (or photographed), in any reasonable format.
The do-file(s) you created that run this analysis
A log file that contains the results from this exercise.

Part A: Difference-in-differences

This part looks at a simple difference-in-differences model based on Richardson and Troost (2009).¹

Data context

Mississippi is split between two Federal Reserve Districts. During the early years of the Great Depression, each district took a different approach to bank runs. The Sixth District increased lending, while the Eighth District responded by restricting lending to threatened banks. We look at the impact of these policies on bank survival rates using difference-in-differences.

Each row in banks.dta represents a Federal Reserve district in a given year. The dataset is small — use browse to see the full thing.

Variables (Part A)

variable	meaning	notes
`district`	Federal Reserve district	6 or 8
`year`	year
`bib`	number of banks in business	outcome variable

Tip: use describe and browse to confirm the variable names in your dataset.

Questions

Use robust standard errors in all regressions.

Start a new do-file and change directory to your working directory.
In your do-file, start a log and open banks.dta.
Using pencil & paper or electronic means of your choosing (you don’t need to do this in Stata), plot a graph of the number of banks in business, by district, by year.
- Plot number of banks in business on the y-axis and year on the x-axis.
- Include only the years 1930 and 1931.
- Draw separate lines for the numbers of banks in District 6 and District 8.
- Draw a dotted “counterfactual” line based on your understanding of the change in bank policies.
- Mark all four actual values clearly.

Hint: The counterfactual line shows what would have happened to District 8 if it had followed the same trend as District 6. To draw it: start from District 8’s 1930 value and apply the same change that District 6 experienced between 1930 and 1931.

First, we’re going to calculate a difference-in-difference estimator by hand between 1930 and 1931. Using the browse command, fill in $x$ values from the following table:

Number of banks in business

District 1930 1931 1931-1930

District 6 x x x

District 8 x x x

District 8 - District 6 x x x

What is the difference-in-difference estimator?

Number of banks in business
District	1930	1931	1931-1930
District 6	x	x	x
District 8	x	x	x

District 8 - District 6	x	x	x

Hint: Use browse or list if year == 1930 | year == 1931 to see the values you need.

Now, generate the following variables:
- treat: a binary variable equal to 1 for District 8 and 0 otherwise
- post: a binary variable equal to 1 for the year 1931 or greater
- treatXpost = treat*post

Hint: Use tab district and tab year to check the values before generating your variables. For example:

gen treat = district == 8
gen post = year >= 1931
gen treatXpost = treat * post

Using the above variables, estimate the impact of looser lending restrictions on the number of banks using a difference-in-difference estimator, restricting the sample to 1930 and 1931. Write your estimates in equation form.

Reminder: You can restrict the sample within a regression using if without dropping data:

regress bib treat post treatXpost if year == 1930 | year == 1931, robust

Now estimate the same regression (same variables), but remove the sample restriction so all years are included. What is the overall impact of looser lending restrictions on bank survival? Write your estimates in equation form.
State clearly the assumption needed to interpret these difference-in-difference estimators as causal.

Part B: Fixed effects

Next, we’re going to look at the relationship between marijuana use and income using the National Longitudinal Survey of Youth 1997 Cohort (NLSY97).

Data context

Each row in nsly_marijuana.dta is an individual-year observation from the NLSY97 — the same people surveyed across multiple years. This is panel data: we observe the same individuals over time, which lets us control for time-invariant individual characteristics (like innate ability or family background) using fixed effects.

Variables (Part B)

variable	meaning	notes
`id`	individual identifier	use with `xtset`
`year`	survey year (1997–2011)	use with `xtset`
`income`	total wage and salary income
`marij`	used marijuana in past year	1 = yes, 0 = no
`gender`	gender	1 = male, 2 = female
`race`	race/ethnicity	4 categories (use `tab race` to see labels)

Questions

Now switch to the second dataset. Open nsly_marijuana.dta in your do-file.
If starting a new do-file, set your working directory and start a log. (You can also continue in the same do-file from Part A.)
How many individuals are in the data? How many years are they observed?

Hint: Try codebook id to see the number of unique individuals, and tab year to see which years are in the data.

Estimate a regression of whether marijuana use (marij) affects income, with no additional controls. Report your results in equation form.
Estimate a regression of whether marijuana use affects income, but add any controls you deem important (from the relatively limited selection available — use describe to see what’s there). There is no single correct answer — use your judgment and explain your choices. How do the results change? Report your results in equation form.
One way to estimate fixed effects models is to use xtreg with the ,fe option. Use xtset to tell Stata you have panel data, then estimate a fixed-effects regression of whether marijuana use affects income.

Your model should include:
- Individual-level fixed effects (these come from xtreg ... , fe)
- Year-level fixed effects (add i.year to your regression)
- Clustered standard errors at the individual level

Step by step:

xtset id year
xtreg income marij i.year, fe cluster(id)

Clustering standard errors at the id level accounts for the fact that observations from the same person across years are not independent.

What is the coefficient on marij? What is the interpretation?
After adding fixed effects, should you include controls for gender and race/ethnicity to reduce omitted variable bias? Why or why not?

Think about it: What happens to a variable that never changes within an individual when you include individual fixed effects?

How do your results in question 14 using fixed effects compare to your results in questions 12 and 13? Why do they differ?
Name one specific factor that would create omitted variable bias in the pooled OLS regressions (questions 12–13) but is controlled for by fixed effects.

Based on Chapter 5 of Mastering ‘Metrics. ↩︎

Research Paper: Referee Report

Wed, 01 Apr 2026 00:00:00 +0000

Purpose:

To practice generating constructive criticism for imperfect work
To link what we’ve learned in class to an original analysis
To generate useful feedback for your peers

Key Elements

Summary of the paper – to convey to the editor that you understand the paper.
Major comments

Big-picture things that the author could make to improve the paper
2-4 major comments is sufficient
Not enough to criticize!

Minor comments:

Clarifying questions, areas that are unclear
Small changes the author could make to improve the paper – add an additional covariate, try adjusting the specification, etc.
Depending on paper, you could have just a few, or you could have a lot
Don’t copy edit the paper!

What should this look like?

All the components above
- At least 2 major comments (substantial suggestions)
- At least 3 minor comments (minor suggestions)
Will probably be 1-2 pages
- 3 pages = overkill, < 1 page, dig in deeper!
Written in collegial tone

Submission

Submit your referee report as a Word or PDF document
And send directly to your partner

Note that I’m assigning referee partners at an individual level, not at a paper level. If you are working with a partner, each of you will complete a referee report, and your paper will receive two reviews.

Research Paper: Rough Draft

Wed, 01 Apr 2026 00:00:00 +0000

This is optional! If you would like feedback on your rough draft, submit it to me and I’ll get back to you within a few days.

I cannot accept extensions on the rough draft. However, partial drafts are very welcome — submit whatever you have and I’ll give you feedback on it.

Have Stata/coding questions? Struggling with framing? Rae is very available for email support — don’t hesitate to reach out to them!

Lab 6: Internal validity and LPM

Fri, 20 Mar 2026 00:00:00 +0000

Print-friendly pdf

Materials

acs2024_4pct.dta
Do-file template econ3500_lab_template.do

Objectives

Today we’re going to work with acs2024_4pct.dta, which contains information from the 2024 American Community Survey.

By the end of this lab, you should be able to complete the following tasks in Stata:

Think about sample selection issues
Estimate and interpret linear probability models
Reason about omitted variable bias and measurement error

Data context

Each row in acs2024_4pct.dta is an individual from the 2024 ACS microdata. The file includes demographics, education, labor-force status, work hours, and earnings variables. We will restrict our sample to married adults and explore the gender wage gap, labor force participation, and how sample selection affects our estimates.

Tip: use describe, codebook, and tab ... , nolabel to check labels and coding for any variables you plan to use.

Variables we’ll use

variable	meaning	notes
`incwage`	wage and salary income	check for topcodes (999999 = N/A)
`sex`	sex	1 = Male, 2 = Female
`age`	age in years
`marst`	marital status	1 = married spouse present, 2 = married spouse absent
`labforce`	labor force status	0 = N/A, 1 = not in LF, 2 = in LF
`uhrswork`	usual hours worked per week	0 = N/A (did not work last year)
`wkswork1`	weeks worked last year	0 = did not work

Key commands

command	description
`codebook var1`	Look at key details for `var1`
`clonevar var1 = var2`	Make a new variable, `var1` that duplicates `var2` (including labels)
`_pctile var1, per(99)`	Calculate the 99th percentile of `var1`, and store as a local variable
`ret list`	Show locally stored variables (handy!)

Linear Probability Models

What happens when our dependent variable is binary? We can use it anyway! Using OLS with a binary dependent variable is called a linear probability model. There is plenty of debate about whether (and when) this is an okay idea, as it can lead to predictions that are below zero or greater than 1, and it violates homoskedasticity assumptions. We can fix the latter by estimating heteroskedasticity-robust standard errors, and the general consensus seems to be that usually, we’re okay using a LPM. (Though we can do better!)

What about interpretation? We interpret coefficients in percentage points (not percents!)

Consider the following:

$Married_i = \beta_0 + \beta_1 age_i + \beta_2 educ_i + u_i$

$\beta_1$ means that a 1-year increase in age is associated with a 100*$\beta_1$ percentage-point change in the probability of being married. So if $\beta_1$ is 0.05, that means that being one year older is associated with a 5 percentage point increase in the likelihood of being married.

LPM in Stata

A linear probability model looks exactly like a typical OLS regression — but your dependent variable is binary (0/1):

regress lf female, robust

The coefficient on female tells you the change in the probability (in decimal form) of being in the labor force associated with being female. Multiply by 100 to express in percentage points.

For great slides on this (and a deeper dive), check out this resource!

Lab Video

** Note that this video is from an earlier version of the lab that used 2016 data from the Current Population Survey. Details may vary, but the implelmentation is the same!**

Workflow overview

Load the dataset and start your log file.
Restrict the sample (married adults only).
Inspect and clean variables (codebook, tab, replace N/A codes).
Generate binary indicators (female, lf).
Run regressions, adding controls sequentially and interpreting results.
Construct new variables (hourly wages, log wages) and analyze outliers.
Answer the worksheet questions about internal validity throughout.

Lab 6 Worksheet

What do I submit?

Your written up answers to exercise questions (1) - (18). This can be typed or written out then scanned (or photographed), in any reasonable format.
The do-file you’ve created that runs this analysis
A log file that contains the results from this exercise.

Use robust standard errors in all regressions.

Example:

regress incwage female, robust

Questions

Open Stata, start a new do-file (or use the template). Make sure you add code to start (and end) a log.
Open acs2024_4pct.dta and restrict the sample to adults (age 18+) who are married (spouse present or absent). Use tab marst, nolabel to identify the correct codes. Confirm that you have 59,039 observations.
Check work hours (uhrswork), weeks of work (wkswork1), and wage income (incwage) for any N/A codes. In this dataset, uhrswork == 0 means “did not work last year” (N/A) — replace these with missing. Also check whether incwage has any topcode values (999999). Use the codebook command to help (e.g. codebook uhrswork). Ensure you have the correct means and number of observations:


Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
wkswork1 | 59,039 30.70257 24.59739 0 52
uhrswork | 37,796 39.18065 12.56177 1 99
-------------+---------------------------------------------------------
incwage | 59,039 50505.14 84753.23 0 907000

Generate a binary variable female equal to one if sex == 2. Estimate the impact of female on wage income (incwage) among your sample of married individuals. What is the interpretation of the coefficient?
If our objective is to measure the impact of gender on wage income among married people, is sample selection bias likely to be important? Why or why not? Is measurement error likely to be important? Why or why not? If so, what is the likely impact of measurement error on your estimated coefficients?
Create a binary variable lf equal to 1 if an individual is in the labor force (labforce == 2), and 0 otherwise. Estimate the impact of gender on labor force status. What is the interpretation of the coefficient?

Reminder: This is a linear probability model! Your dependent variable (lf) is binary, so interpret the coefficient in percentage points.

What is the impact of being in the labor force on wage income? Based on this and the previous question, what is the implication for the direction of omitted variable bias when you estimated $incwage = \beta_0 + \beta_1 female + u$ without controlling for labor force participation status?
Re-estimate the previous regression, including a control for lf: $incwage = \beta_0 + \beta_1 female + \beta_2 lf + u$. Was your prediction in part (7) correct?
Now, add your cleaned variable for usual hours worked to estimate $incwage = \beta_0 + \beta_1 female + \beta_2 lf + \beta_3 uhrswork + u$. What is the interpretation of each coefficient?
Why does your regression not include all 59,039 people? What type of bias might this introduce?
Is measurement error likely to be important in the previous regression, and if so, for which variables? What is the likely impact of measurement error on your estimated coefficients?
Generate a new variable uhrsNZ that recodes all missing work hours values as zeros. You can expedite this with the clonevar command, which retains variable labels. Re-estimate the impact of gender, labor force status and uhrsNZ on wage income (incwage). That is, you’re replacing uhrswork with uhrsNZ. What is the interpretation on each coefficient? Why did it change?
Now, re-estimate but exclude lf: $incwage = \beta_0 + \beta_1 female + \beta_3 uhrsNZ + u$. How do your results change? Conditional on including female and uhrsNZ, does it make sense to include lf?
Create a new variable that estimates log wages: gen l_incwage = log(incwage). Estimate the impact of gender on logged wage income, including a control for uhrswork. How does the sample size change, and why? What is the interpretation of each coefficient?
Using the cleaned variables, calculate hourly wages: gen hourwage = incwage / (uhrswork * 50). We assume that people work 50 weeks in one year. What are mean hourly wages for men and women?
Estimate the impact of gender on hourly wages for those with positive hourly wages, controlling for usual hours worked (uhrswork). Then, replace missing hourly wages with 0 for those who worked but earned no wages, and re-estimate. How does the impact of gender on earnings compare between the two regressions? Why does the sample size change?
Do outliers affect your results? Exclude observations that exceed the 99th percentile in wages based on incwage, and re-estimate both equations from the previous question. How do your results change?

Hint:

_pctile incwage, per(99)
ret list

This stores the 99th percentile value, which you can use to filter observations.

Is measurement error likely to affect your dependent variable, hourwage? Why or why not? If so, what are the implications?

Submission checklist

Answers to questions (1)-(18)
Do-file with comments for each question
Log file that matches your do-file commands
Make sure your do-file includes log close at the end

Research Paper: Research Proposal

Tue, 17 Mar 2026 00:00:00 +0000

Objective

The goal of this submission is for you to translate your research idea and data set into the outline of a workable paper. You can think of your research proposal as “baby paper”: a summary of what your question is, why it matters, and how you intend to solve it.

For this assignment, some basic data work is necessary. However, you may find it helpful to explore some analyses to get a better sense of what empirical specifications are feasible.

Components

While our first two assignments were fairly informal, this is a formal paper. That means that I will pay close attention to not only what ideas you present, but also how you present them.

Your proposal should be at least 1200 words (excluding references) and include the following components:

Introduction that contains (1) a clearly stated research question. What hypotheses are you testing? (2) Motivation — why is this important/interesting? Include at least 2 sources.
Literature review: discussion of related literature and how your paper fits in. Include at least 4 peer-reviewed sources that are distinct from your introduction sources (i.e., at least 6 unique sources total across the introduction and literature review).
Data description: Description of data set — make sure you include the sources! You will need to have loaded and explored your data enough to produce the summary statistics table below.
Summary statistics table that includes the basic descriptive statistics that will be relevant to your analysis. Ultimately, this table will probably be the first table in your final paper. It should be formatted nicely (i.e., not copied directly out of Stata) with easy-to-interpret variable names and column headers, and it will likely include the following:
1. Number of observations
2. Means and standard deviations of your dependent variable(s)
3. Means and standard deviations of your key independent variable(s)
4. If you are comparing two (or more) groups, you will want to report means separately for each group.
5. Any other details that might be relevant to your data (i.e. number of states, number of years, number of households, etc.)
Empirical specification: You must include the empirical specification of the regression(s) you are estimating in equation form, along with a clear description of what each variable is.
Planned analysis: How will your results answer your research question? Make sure your assertions are qualified and well-supported.
Threats and limitations: What challenges will you face in interpreting your results? Discuss potential threats such as omitted variable bias, reverse causality, measurement error, or other violations of our assumptions, and how you might address them.
Bibliography — for any references cited in your proposal plus any data sources
(Optional) Outline of tables: In as much detail as possible, outline the tables you plan to include (no numbers necessary)

Your annotated bibliography will help you push forward your motivation and the literature review.

Keep in mind that the more detail you include, the better the feedback you’ll receive! A classmate will provide a peer review of your proposal, providing feedback to help you turn your proposal into a final paper

Rubric

This assignment is worth 38 points. Each criterion is scored on a 5-level scale (does not meet → fully meets), with proportional points at each level.

Component	Criteria	Points
	Introduction	6
1	Research question is clearly stated, specific, and answerable	2
1	Question is well-motivated using at least 2 sources	4
	Literature review	4
2	Important literature discussed and linked to research question; at least 4 peer-reviewed academic sources (distinct from introduction sources)	4
	Data & summary statistics	6
3	Data set(s) clearly identified, appropriate, and cited	2
4	Summary statistics table includes key variables; clearly and accurately conveys info in a formatted table	4
	Empirical strategy	10
5	Empirical specification clearly stated (equation form), variables defined	4
6	Empirical strategy and proposed analysis shows critical thinking	4
6	Assertions are qualified and well-supported (e.g., avoid overstating causal claims)	2
	Threats & limitations	2
7	Discusses potential threats to interpretation and potential solutions	2
	Presentation & formatting	10
8	Cited references included in properly-formatted bibliography	2
—	Meets formatting requirements (length, paragraphs, etc.)	4
—	Writing style clear and easy to read; few grammatical or typographical errors	4
	Total	38

Submission requirements

Your research proposal should be written in paragraph form (i.e. complete sentences, not bullet points), copy-edited for grammatical/spelling errors, and submitted as a Word or PDF document.

Examples

See Brightspace for example proposals. These are not perfect, but they are all of high quality.

Lab 5: Merging and hypothesis tests

Mon, 09 Mar 2026 00:00:00 +0000

Print-friendly pdf

Materials

acs2024_4pct.dta
Do-file template econ3500_lab_template.do
BLS county unemployment data laucnty24.xlsx (or download from BLS)

Objectives

Today we’re going to work with acs2024_4pct.dta, which contains information from the 2024 American Community Survey. Note that this is a different version from what we have been using! It has a few more variables and also a larger sample.

We’re going to merge county-level unemployment rates from the Bureau of Labor Statistics.

By the end of this lab, you should be able to complete the following tasks in Stata:

Import data from Excel
Merge data sets
Test hypotheses for linear combinations of coefficients
Test the general significance of a regression

Data context

Each row in acs2024_4pct.dta is an individual from the 2024 ACS microdata. The file includes demographics, education, labor-force status, earnings, and geographic identifiers at the state and county level. The BLS county unemployment file (laucnty24.xlsx) contains 2024 annual average labor force statistics for every U.S. county.

We will merge the two datasets by county, matching on state and county FIPS codes.

Variables we’ll use

ACS data (acs2024_4pct.dta)

variable	meaning	notes
`inctot`	total personal income	9999999 = N/A; replace before analysis
`educ`	educational attainment	numeric categories; check labels with `tab educ, nolabel`
`labforce`	labor force status	2 = in labor force (check with `tab labforce, nolabel`)
`age`	age
`statefip`	state FIPS code	used for merging
`countyfip`	county FIPS code	0 = not identified; used for merging

BLS data (laucnty24.xlsx)

column	meaning	notes
State FIPS Code	2-digit state code	imported as string; needs `destring`
County FIPS Code	3-digit county code	imported as string; needs `destring`
County Name/State Abbreviation	county name
Labor Force	total county labor force
Employed	county employed count
Unemployed	county unemployed count
Unemployment Rate (%)	county unemployment rate

Key commands

command	description
Importing data
`import excel using "file.xlsx", firstrow clear`	Import an Excel file. `firstrow` uses row 1 as variable names. `clear` erases existing data.
`import excel using "file.xlsx", cellrange(A2) firstrow clear`	Same, but start reading from cell A2 (useful when row 1 is a title, not data).
Identifying duplicates
`duplicates list var1 var2`	List any observations that are duplicates on the listed variables.
`duplicates tag var1 var2, gen(d1)`	Generate a new variable, `d1`, that indicates which observations are duplicates for `var1` and `var2`.
Merging datasets
`merge 1:1 var1 var2 using file2`	One-to-one merge on `var1` and `var2`. No duplicates allowed in either dataset.
`merge m:1 var1 var2 using file2`	Many-to-one merge on `var1` and `var2`. Duplicates OK in master data (like merging county data into individual data) but not in using data.
Converting between string and numeric variables
`destring var1, gen(newvar)`	Convert a string variable to numeric, saving as `newvar`.
`destring var1, replace`	Convert a string variable to numeric, replacing the original.
`tostring var2, gen(string_var)`	Convert a numeric variable to string, saving as `string_var`.
Statistical tests
`test var1 = var2`	Run after a regression. Tests whether the coefficient on `var1` equals the coefficient on `var2`.
`testparm var1 var2 ...`	Run after a regression. Tests whether all listed variables are jointly equal to zero.

A note on temporary files (optional)

This exercise works by having two data sets stored on your hard drive, then running a merge command to unite them. You might notice that the workflow feels clunky and generates extra files — open a data set, save it, open another data set, then merge in the first data set.

You can use temporary files to speed things up! Basically, you can save files in your local memory, and call those files the same way we called local variables. Everything has to be run in the do-file for this to work.

A short example (you can paste this in a do-file and run it, as it uses built-in Stata files):


tempfile tempauto // Declare tempfile (needs to run before you try to save)
webuse autosize, clear
save `tempauto', replace // save to temp file
webuse autoexpense, clear
merge 1:1 make using `tempauto' // call tempfile
tab _merge // check out merge
list

Workflow overview

Import the BLS county unemployment data from Excel.
Clean variables and save as a Stata data file.
Open the ACS data and restrict the sample.
Merge in county-level unemployment by state and county FIPS codes.
Create education indicators and run regressions.
Conduct hypothesis tests.

Lab Video

Lab 5 Worksheet

What do I submit?

Your written up answers to the exercise questions. This can be typed or written out then scanned (or photographed), in any reasonable format. Note: Question 21 is optional.
The do-file you’ve created that runs this analysis
A log file that contains the results from this exercise.

Exercises

Part 1: Import and prepare unemployment data

Visit https://www.bls.gov/lau/tables.htm to access 2024 annual county-level unemployment rates. Download the appropriate table as an Excel file.¹

a. Open the file in Excel or another spreadsheet program. Notice that the first row contains a title and the actual column headers start in the second row.

b. You do not need to edit the file — we’ll handle everything in Stata.

Open Stata and start a new do-file using the template. Update the file paths and add code to start (and end) a log.
Import your unemployment Excel file into Stata. Because the first row is a title (not column headers), use the cellrange option to start reading from row 2:
```
import excel using "laucnty24.xlsx", cellrange(A2) firstrow clear
```
Run describe to see the variable names Stata assigned. How many observations (counties) are there?
The FIPS code variables were imported as strings (text), not numbers. Convert them to numeric variables so they match the ACS data:
```
destring StateFIPSCode, gen(statefip)
destring CountyFIPSCode, gen(countyfip)
```
(If Stata named your variables differently, check with describe and adjust accordingly.)
Check for duplicates on statefip and countyfip. Are there any? (There shouldn’t be — each county should appear exactly once.)
Save your unemployment data as a Stata file:
```
save "unemp_2024.dta", replace
```

Part 2: Merge with ACS data

Open acs2024_4pct.dta and restrict the sample to adults (age 18+).
Before merging, take a look at the county identifier in the ACS data. Tabulate countyfip. What do you notice about the value 0?²

Now, merge your unemployment data into the ACS by county:
```
merge m:1 statefip countyfip using "unemp_2024.dta"
```
a. Why do we use m:1 (many-to-one) instead of 1:1?

b. Tabulate the _merge variable. What share of observations successfully merged?³

Drop any unmatched observations (you can use drop if) and drop the _merge variable. What is the average unemployment rate for the sample — why would this be different than taking an average of county unemployment rates from your Excel file?

Part 3: Education variables and regression

Why can’t we use educ directly as a linear variable in a regression?
Generate three dummy variables. These three variables should be mutually exclusive, and they should not be missing for any observations.
- lesshs, a variable equal to one if a person completed less than a high school diploma
- hsgrad, a variable equal to one if a person completed at least a high school diploma but less than a Bachelor’s degree
- colgrad, a variable equal to one if a person completed a Bachelor’s degree or higher
Note: Education is coded with labels, which means that it is numeric data with a description of what each number means on top. These show up as blue in the Stata browser. To see the underlying codes: tab educ, nolabel.
What is the mean of lesshs, hsgrad, and colgrad?
Before running a regression, check inctot (total personal income) for N/A codes. Replace any N/A values as missing.⁴ Then estimate a regression of total personal income on education, using the binary variables you just created. Omit lesshs. Use robust standard errors.

Part 4: Hypothesis tests

Set up a hypothesis test for whether both hsgrad and colgrad are jointly significant. Report the null hypothesis, alternative hypothesis, test statistic, and conclusion.
Set up a hypothesis test for whether the returns to being a high-school graduate are the same as the returns to being a college graduate. Report the null hypothesis, alternative hypothesis, test statistic, and conclusion.
Is this regression significant overall? Explain how you know.

Part 5: Adding unemployment

Now add county-level unemployment rate to the previous equation. What is the interpretation of the coefficient on unemployment? Is it statistically significant?
Estimate the same equation by regressing total personal income on the education binary variables and county-level unemployment, restricting to those who are currently in the labor force. How does this change the coefficient on unemployment?
Identify three state or county-level variables that are likely to cause omitted variable bias if you want to know whether unemployment affects individual income.
(Optional) For one of the variables you listed above, find the data online, import into Stata, and merge it in. Regress total personal income on the education binary variables, county-level unemployment, and the new variable you found. Restrict your sample to those who are currently in the labor force. How does the inclusion of your new variable affect the coefficient on unemployment?

If you have trouble accessing the BLS website, you can use the file provided in the lab materials above. ↩︎
In IPUMS data, countyfip = 0 means the county is not identified — the Census Bureau withholds county identifiers for small counties to protect confidentiality. These observations cannot be matched to BLS data. ↩︎
Expect roughly 40–60% of observations to match. The main reason for non-matches is that many ACS respondents have countyfip = 0 (county not identified). ↩︎
Use summarize inctot to check for suspicious values. In IPUMS data, 9999999 typically means N/A. ↩︎

Lab 4: Multivariate Regression

Thu, 19 Feb 2026 00:00:00 +0000

Print-friendly pdf

Materials

acs2024_2pct.dta
Do-file template econ3500_lab_template.do
Looping exercise loop_example.do

Objectives

Today we’re going to work with, acs2024_2pct.dta, which contains information from the 2024 American Community Survey. We used this in Lab 02!

By the end of this lab, you should be able to complete the following tasks in Stata:

Estimate and interpret multiple linear regression in levels, using continuous and binary independent variables, and use heteroskedasticity-robust standard errors.
Interpret the results of multivariate linear regressions in terms of statistical and economic significance
Practice generating binary variables from categorical measures
Set up basic loops
Use xi and i. prefix to include a lot of binary indicator variables at once.

Data context

Each row in acs2024_2pct.dta is an individual from the 2024 ACS microdata. The file includes demographics, education, labor-force status, and earnings variables. We will focus on variables like incwage, educ, labforce, statefip, race, hispan, and age.

Tip: use describe and codebook to check labels and coding for any variables you plan to use.

Variables we’ll use

variable	meaning	notes
`incwage`	wage and salary income	check labels for topcodes or missing values
`educ`	educational attainment	numeric categories; check labels
`labforce`	labor force status	values like 0/1/2 (check labels)
`race`	race code	use to build indicators
`hispan`	Hispanic origin	use to build indicators
`age`	age	used to construct year of birth
`statefip`	state	use with `i.statefip`

👁️ Tip: codebook race is a quick way to check variable labels for race! 👁️

Key commands

command	description
`regress var1 var2...`	Estimate a regression, with `var1` as the dependent variable and all others as the independent
variable(s)
`tabulate var1,nolabel`	Tabulate variables without labels
`replace var1 = . if var1 == 999999`	Replace `var1` as missing (using a dot) if `var1` is equal to 999999. Can be replaced with any other values or logical expressions.

Creating binary variables

Recall that there are two easy ways to make binary variables out of categorical or continuous variables. Consider the variable race, where 1 = White, 2 = Black, 3 = Native American, etc. Suppose you want to generate a binary indicator for whether a person is White.

gen white = race == 1: generates a variable equal to 1 if race is 1, and 0 otherwise
gen white = 1 if race == 1: generates a variable equal to 1 if race is 1, and missing otherwise. To complete this you need two lines of code:
gen white = 1 if race == 1
replace white = 0 if race != 1

Working with loops

Loops can help us (1) avoid errors and (2) code super fast!

I’ve uploaded a sample as loop_example.do

Stata has two types of looping setups, using the forval or foreach command. The first is simpler, and the second is more versatile. Recall that you can always use help forval or help foreach if your code isn’t working or if you have a vision you’re not sure how to realize.

Looping with `forval`

forvalues lname = range {
Stata commands referring to `lname'
}

What does each component mean?

forvalues: this is the command. You can abbreviate it as forval.
lname: this is a variable you make up. Often, people will just use i, becuase we’re just counting. It will take on the values in range as it increments through the loop. It is a local variable, meaning that you have to call it using `lname' and not as lname (need those punctuation marks!), and that it is only saved as long as your do-file is running.
range: this is the set of values that the local variable will iterate over
Brackets: Open bracket needs to be on same line as the forval command. Close bracket needs to be on its own line.

forval i = 0/2{
gen labfor`i' = labforce == `i'
}

What does this do? It creates a loop for which local variable `i' is first 0, then 1, then 2. Within the loop, it generates labfor0, which is equal to 1 if labforce equals 0 (not in universe), it generates labfor1, which is equal to 1 if labforce equals 1 (not in labor force), and it generates labfor2, which is equal to 1 if labforce equals 2 (in labor force).

Applied ACS example: create race indicators in a loop.

foreach r in 1 2 3 {
gen race_`r' = race == `r'
}

Use tab race, nolabel to see the codes you want to include.

The choice of ranges can be done in other ways:

forval i = 0/10: hits every integer between 0 and 10 - 0, 1, 2, … 10
forval i = 1(10)100: starts at 1, then increments by 10, stopping at 100: 1, 11, 21, 31, … 91
forvalues k = 5 10 to 300: starts at 10, then increments by 5 until 300: 5, 10, 15, …

See help forval for more options

Looping with `foreach`

This command lets you loop through number lists (like above), but also through sets of variables, values, names, etc. You can approach it two ways:

Do not specify the type of list, use in: foreach lname in list:
Specify the type of list (listtype), use of: foreach lname of listtype list

This is confusing until we see examples:

foreach x in "rice wheat corn rye barley oats" {
display "`x'"
}

This will start with x equal to the string “rice”. Then, it will run with x equal to “wheat”, etc.

 foreach num of numlist 1 4/8 13(2)21 103 {
display `num'
}

This will loop over 1, 4, 5, 6, 7, 8, 13, 15, 17, …

You can loop over variable names too!

foreach var of varlist inc* {
summarize `var',d
}

This summarizes (with detail) each variable that starts with inc

Working with binary independent variables

When you are representing a categorical variable with a set of binary variables, there is a slow way and a fast way to integrate them.

Slow way: generate the binary variables you want, and include them. This is good when you want to be precise about your omitted variable, or when you want to create complicated binary categories

gen white_nh = race == 1 & hispan == 0
gen black_nh = race == 2 & hispan == 0
gen hisp = hispan != 0
gen other = white_nh == 0 & black_nh == 0 & hisp == 0
regress incwage black_nh hisp other

Here, white, non-Hispanic is the omitted “reference” category.

Fast way: tell Stata to create a binary variable for each value of a categorical variable.¹ This is good when you aren’t trying to do anything complicated and when you want to be quick - very useful if you want something like state-level dummies.

regress incwage i.race

Note that this will work only if your categorical variable is numeric. If it’s a string you’ll get an error. You can fix it by adding a xi: prefix, like so:

xi: regress incwage i.race

When we include a dummy variable for every value of a categorical variable, like above, we call those “fixed effects.” We’ll talk about these more soon.

Reading regression tables (reminder!)

Labeled Stata output

Workflow overview

Load the dataset and start your log file.
Inspect variables and coding (describe, tab, tab ... , nolabel).
Create binary indicators needed for your analysis.
Run baseline regressions with robust standard errors.
Add controls or fixed effects and compare coefficients.
Interpret results and answer the worksheet questions.

Lab 4 Worksheet

What do I submit?

Your written up answers to exercise questions (1) - (17). This can be typed or written out then scanned (or photographed), in any reasonable format.
The do-file you’ve created that runs this analysis
A log file that contains the results from this exercise.

Questions

Download the do-file template and data files. Personalize the file paths so that you can run it and open your acs2024_2pct.dta file. You can also work with a blank data file if you’re more comfortable - just make sure you remember to include commands to start and close your log file.

Use robust standard errors in all regressions

Example:

regress incwage educ labforce, robust

Let’s practice with loops! Download loop_example.do and paste the code into your sample. Run it and look at the output. In your do-file, write comments that describe what each loop is going.
Now, go back to your acs2024_2pct.dta file and do-file template. Adjust your do-file template so that it loads acs2024_2pct.dta and starts a log.
Restrict your sample to individuals ages 25-54.
Create a new variable, birthyr, equal to each individual’s year of birth: gen birthyr = 2024 - age. Is there any potential imprecision or error in this variable?
Then, write a loop to generate a dummy variable for each possible value of birth year.²

Look through the available list of data (note, IPUMS has full documentation of all variables). Based on this data, think of a research question for your lab of the form, “What is the relationship between .… and ...?”. Pick a dependent variable that is continuous. (Because a later question asks you to explore race/ethnicity controls, please do not use a race/ethnicity variable for $X$.)

Research question:\

Dependent variable ($Y$):\

Key independent variable ($X$):\
Using the data available, write a reasonable population model, including your key independent variable along with a set of likely relevant independent variables (somewhere between 2 and 5 additional variables). Before estimating your regression, you should tabulate each variable to make sure you are interpreting it correctly.
In words, what exactly will your estimated regression tell us?
What do you hypothesize the answer to your research question is? (i.e. strong positive, weak negative, none)
Before you estimate your model, make sure you don’t have any N/A values coded. For example, if incwage is not applicable, it is coded as 9999999. Tabulate or summarize your data to check for any values like this. Replace any values as missing if they are equal to some N/A code (see above).
Estimate the relationship between $X$ and $Y$ using simple linear regression (excluding any other covariates). Write your results in equation form and report the $R^2$. How many degrees of freedom do you have?
Estimate the relationship between $X$ and $Y$ using multiple linear regression (including other covariates). That is, estimate the population model you wrote earlier. Write your results in equation form and report the $R^2$. How many degrees of freedom do you have?
Using your multivariate linear regression from the previous step, set up a hypothesis test for your parameter of interest, the $\beta$ associated with your key independent variable, $X$. What do you find? What is the p-value? What is the interpretation?
Besides your key independent variable, which other variables are statistically significant at the five-percent level?
A lot of student research papers will look at differences in outcomes by gender and by racial/ethnic groups. U.S. surveys like the CPS, ACS, and Census treat race and ethnicity a little strangely, and it can take some practice to get comfortable.

There are two variables commonly used to identify a person’s race and ethnicity: the race and the hispan variable.
1. What share of the sample is White, non-Hispanic?
2. What share of the sample is Hispanic/Latino?
3. A common way to summarize the racial/ethnic make-up of the U.S. is the following categories:
  - White, non-Hispanic
  - Black, non-Hispanic
  - Hispanic/Latino
  - Asian, non-Hispanic
  - Other
  Make a table that shows the distribution of the population into these five groupings.
Estimate your multiple linear regression model from earlier, but include the race/ethnicity variables that you created in the previous part. How do the inclusion of these factors affect your estimates of the relationship between $Y$ and $X$?
Now, add “birth-year fixed effects” to your regression that you generated earlier. Because there is a set of binary 0/1 variables, one for each year of birth, they will essentially pull out any mean differences in your dependent variable at the birth-year level - so if your outcome variable is different for people born in 1971 vs 1971 on average, these variables will take care of it. What is the omitted birth year? How do the inclusion of these factors affect your estimates of the relationship between $Y$ and $X$?

Video

A note on “well-behaved” residuals

There are three characteristics of “well-behaved” residuals:

The residuals “bounce randomly” around the 0 line. This suggests that the assumption that the relationship is linear is reasonable.
The residuals roughly form a “horizontal band” around the 0 line. This suggests that the variances of the error terms are equal.
No one residual “stands out” from the basic random pattern of residuals. This suggests that there are no outliers.

We don’t want to overweight the importance of this, but it can be a helpful diagnostic to look for outliers, strange patterns.

i.race adds a dummy for every race category and estimates effects relative to the omitted category. Don’t manually include a full set of dummies with an intercept, or you’ll run into perfect multicollinearity (the “dummy variable trap”). ↩︎
There is a faster way to do this, using xi i.birthyr, but we’re learning about loops, so just go with it. ↩︎

Research Paper: Annotated Bibliography

Mon, 16 Feb 2026 00:00:00 +0000

Print-friendly PDF

Objective

The goal of this submission is to help you narrow and refine your question while situating your work in the broader economics literature. This will make writing your research paper much easier as well!

What is an annotated bibliography?¹

A bibliography is a list of sources (books, journals, Web sites, periodicals, etc.) one has used for researching a topic. Bibliographies are sometimes called “References” or “Works Cited” depending on the style format you are using. A bibliography usually just includes the bibliographic information (i.e., the author, title, publisher, etc.).

An annotation is a summary and/or evaluation. Therefore, an annotated bibliography includes a summary and/or evaluation of each of the sources.

What do I need to do

Pick the idea you proposed that is most promising. You may refine it based on feedback, further reflection, etc.

Based on that idea, identify and annotate six sources that are relevant to your project.

At least four must be peer-reviewed, academic journal articles.
At least two must be from the list of economic journals included below.

For each one, include the following:

Full bibliographic information, following MLA, APA, or Chicago style.
The annotations, written as a paragraph or as bullet points. These will include a few things: a. Nature of source: peer-reviewed academic journal (what discipline), working paper, white paper (ie reports from major organizations), other b. Key findings or arguments of the source: It’s in your interest to be quite detailed here (I like to use these to draw on when I write my paper) c. Assessment: How does it compare to other sources? (findings support or contrast)? Is the source biased or objective? What is the goal of the source? d. Reflection: Is this useful to your question? How does it help you shape your argument? How can you use this source in your project? (Here I will sometimes add sample sentences I will write)

After this, write an expanded version of your idea proposal (just one the idea you’ve chosen) that states your refined research question and describes how your planned project fits into the literature you found. This will be 2-3 paragraphs.

Consult the grading rubric for additional guidance!

Acceptable economics journals

At least two article must come from a relevant economics journal in the top 200 from the following RePeC list. If your topic is very specific, the top 400 is also acceptable, but you need to get prior approval from me.

Submission requirements

Submit the annotated bibliography plus summary as one document on Brightspace

If you are working in pairs, submit one bibliography for two people.

Tips

Search for your topic using EconLit. If you at home, you can select “Connect Off Campus” from the main library page
When determining if an article might be useful, start by focusing on the abstract only. Among those that pass your abstract test, then just read the introduction to see if they are still going to help.
When you’ve found an article or two that are useful, you can search forward and backward to find more!
- Check the references section to find articles that were cited in your paper
- Use Google Scholar to find articles that cite your paper
Note that working paper series are not peer-reviewed journal articles. NBER Working Papers, for example, are an excellent resoucre but not peer-reviewed. Most reports from large organizations are not peer reviewed.

Rubric

You will receive up to 50 points on this assignment:

Each source is worth 5 points (30 total), with one point per element listed above.
The idea summary is worth 10 points, with full credit granted if you present your research question, describe in words how you will answer it, and then describe how what you plan to do fits in with the literature you’ve reviewed.
The final 10 points are for meeting the source selection critera (4 peer-reviewed, 2 top 200 economics journals)

Pulled from Purdue OWL ↩︎

Problem set 2

Thu, 05 Feb 2026 00:00:00 +0000

Welcome

Make sure you submit your assignment on Brightspace by the deadline ☝️ ☝️

See the exercises below, or you can download them as a pdf. You can download the data file you need for questions 4 and 5 here.

What do I submit?

Your written up answers to exercise questions. If you work on a piece of paper, please scan using some sort of phone software (like Microsoft Lens or Adobe Scan) rather than just taking a picture. You can also integrate your written answers into your do-file, just be clear about it.
A do-file that runs your Stata analysis (for questions 4 and 5).
A log file that includes the output from running your do-file (for questions 4 and 5).

Exercises

The following table shows, for eight vintages of delicious wine, purchases per buyer ($y$) and the wine buyer’s rating ($x$) in a given year:

1 2 3 4 5 6 7 8

$x$ 3.6 3.3 2.8 2.3 2.7 2.9 2.0 2.4

$y$ 24 21 22 20 18 13 9 16

a. Estimate by hand the regression of purchases per buyer on the buyer’s rating.

b. Interpret the slope of the estimated regression line.

c. Interpret the intercept of the estimated regression line.

d. Use your estimated regression line to compute the predicted purchases for a wine with rating $x=2.8$. Then compute the residual for observation (3) with $x=2.8$ and $y=22$.

	1	2	3	4	5	6	7	8
$x$	3.6	3.3	2.8	2.3	2.7	2.9	2.0	2.4
$y$	24	21	22	20	18	13	9	16

Suppose that a random sample of 200 20-year-old men is selected from a population and that these men’s height and weight are recorded. A regression of weight (measured in pounds) on height (measured in inches) yields

$$\widehat{Weight}=-99.41 + 3.94 Height$$

$R^2 = 0.81$; $SER = 10.2$

a. What is the predicted weight for someone who is 72 inches tall? 66 inches tall?

b. One 20-year-old man has a late growth spurt and grows 1.5 inches over the course of the year. What is the regression’s prediction for the increase in his weight?

c. Suppose that you want to translate the results of this equation into centimeters and kilograms. What are the regression estimates from this new regression? Give all results, including estimated coefficients, $R^2$, and $SER$.

d. Interpret the $R^2$ value. Does it indicate anything about whether these estimates are likely to be biased? Explain.
Consider the savings function:

$$sav = \beta_0 + \beta_1 inc + u, u = e\sqrt{inc}$$

where $e$ is a random variable with $E(e) = 0$ and $Var(e) = \sigma^2_e$. Assume that $e$ is independent of $inc$.

a. Show that $E(u|inc)=0$, so that the key zero conditional mean assumption is satisfied. [Hint: If $e$ is independent of $inc$, then $E(e|inc) = E(e)$]

b. Show that $Var(u|inc) = \sigma^2_einc$, so that the homoskedasticity assumption is violated. In particular, the variance of $sav$ increases with $inc$. [Hint: $Var(e|inc) = Var(e)$ if $inc$ and $e$ are independent!]

c. Why might it be reasonable to assume that the variance of savings increases with family income?
The data file collegedistance.dta contains data from a random sample of high school seniors interviewed in 1980 and re-interviewed in 1986.¹ Use these data to investigate the relationship between the number of completed years of education for young adults and the distance from each student’s high school to the nearest four-year college. (Proximity to college lowers the cost of education, so that students who live closer to a four-year college should, on average, complete more years of higher education.)

a. Run a regression of years of completed education ($ED$) on distance to the nearest college ($Dist$), where $Dist$ is measured in tens of miles. (For example, $Dist=2$ means that the distance is 20 miles.) You can regress a dependent variable y on an independent variable x with the command regress y x]. Write the equation you estimated in the form $\widehat{ED} = \beta_0 + \beta_1 Dist$

b. How does the average value of years of completed schooling change when colleges are built close to where students go to high school?

c. Bob’s high school was 20 miles from the nearest college. Predict Bob’s years of completed education using the estimated regression. How would the prediction change if Bob lived 10 miles from the nearest college?

d. Does distance to college explain a large fraction of the variance in educational attainment across individuals? Explain.

e. Provide an example of a factor that might cause this model to violate the zero conditional mean assumption. Explain your reasoning.

f. What is the value of the standard error of the regression?² What are the units for the standard error (meters, grams, years, dollars, cents, or something else)?

g. Is the estimated regression slope coefficient statistically significant at the 10% level? What is the p-value associated with coefficient’s t-statistic?

h. Construct a 90% confidence interval for the slope coefficient.

i. Construct a 90% confidence interval for the intercept.

j. Estimate a regression that restricts the sample to men, and calculate a 90% confidence interval for the slope. Do the same, restricting the sample to women. Does it look like the effect of distance on completed years of education is different?³
Re-estimate the regression from 4a using heteroskedasticity robust standard errors.

a. Report the slope estimate, robust standard error, t-statistic, p-value, and a 90% confidence interval for the slope.

b. Compare your robust results to the results that assume homoskedasticity from 4g–4h. Do your conclusions about statistical significance at the 10% level change? Why might the standard errors differ?

c. In practice, which set of standard errors should you report, and why?

These data were provided by Professor Cecilia Rouse of Princeton University and were used in her paper “Democratization or Diversion? The Effect of Community Colleges on Educational Attainment,” Journal of Business and Economic Statistics, April 1995, 12(2): 217–224. ↩︎
There are a few ways to find it in Stata’s output. The easiest is to note that “root MSE” is the square root of the SER. ↩︎
Note that we cannot make claims about whether they are statistically different because the estimates come from two different samples! A hypothesis test here would be awesome, but we need to build a few more skills to do this. ↩︎

Lab 3: Regression

Wed, 04 Feb 2026 00:00:00 +0000

Lab Content

Print-friendly pdf

Materials

graduation.dta
Do-file template econ3500_lab_template.do

Download these and save in your lab folder (perhaps you named it something like econ3500/labs?)

👁️ If your do-file opens in a browser tab, you may want to instead right click and select “Save Link As” 👁️

Before you start

Set your working directory in Stata to the folder where you saved the data and template.
Start a log file right away: log using lab3.log, replace
Make sure you can open the dataset with use graduation.dta, clear.

Objectives

By the end of this tutorial you should be able to complete the following tasks in Stata:

Estimate and interpret a simple (two-variable) linear regression in levels, using continuous and binary variables, and use heteroskedasticity-robust standard errors.
Identify $\hat{\beta_0}$, $\hat{\beta_1}$, standard errors, $SST$, $SSE$, $SSR$, and $R^2$ in Stata output and interpret them
Calculate predicted values and residuals
Create scatter plots
Estimate a multivariate linear regression

Key commands

command	description
Estimation commands
`regress var1 var2`	Estimate a regression, with `var1` as the dependent variable and `var2` as the independent variable(s)
`regress var1 var2, robust`	Estimate a regression with heteroskedasticity-robust standard errors
`correlate var1 var2 ... varn`	Calculate correlation coefficients of all listed variables, from `var1` to `varn`.
`graph twoway scatter var1 var2`	make a scatter plot with `var1` on the y-axis and `var2` on the x-axis.
Post-estimation commands¹
`predict newvar, xb`	Use estimated regression coefficients to predict $\widehat{y}$. It will generate `newvar`²
`predict newvar, residuals`	Use estimated regression coefficients to predict residuals, generating `newvar`³
Working with data, missing values
`count if var1 == 1`	count observations if the expression `var1 == 1` is true
`count if !missing(var1)`	count observations if `var1` is not missing
`drop if missing(var1)`	drop all observations where `var1` is missing
`tab var1, missing`	Include missing values in tabulation

Reading regression tables

Quick reminders

The coefficient estimates do not change when you add , robust.
The standard errors do change when you add , robust.
Run predict immediately after your regression. If you run another command in between, Stata will overwrite the stored model.

Lab 3 Exercise

What do I submit?

Your written answers to exercise questions (1) - (13). This can be typed or written out then scanned (or photographed), in any reasonable format.
The do-file you’ve created that runs this analysis
A log file that contains the results from this exercise.

Questions

Today, we’re going to look around at the graduation data set that we discussed in class, graduation.dta.

Download the do-file template and data files. Personalize the file paths so that you can run it and open your graduation.dta file. You can also work with a blank data file if you’re more comfortable - just make sure you remember to include commands to start and close your log file.
Take a look at graduation.dta. How many observations are there? What is the distribution of treatment arms?⁴
There are six continuous food security variables⁵. You can look for them with lookfor fs. Pick one variable and write out a population model to determine the relationship between assignment to the graduation program and food security. For the rest of this lab, I refer to the variable you chose as foodsecurity. If that’s going to irritate you, you can rename your variable like this: rename fsec5 foodsecurity, using the variable name that you’ve chosen in place of fsec5.
Tabulate your food security value and check for missing observations. Drop any observations for which you have missing values of foodsecurity (see above for how to do this). How many observations are remaining?

Hint After you drop missing values, run count to confirm your new sample size. Keep that number consistent for the rest of the lab.

Make a scatter plot of the relationship between your chosen food security variable and graduation (Include this in your submitted problem set). Is this easy to interpret? Calculate and report the associated correlation coefficient.
Conduct a t-test of whether the mean of foodsecurity is different between those who did and did not receive the graduation program⁶
Estimate the relationship between your chosen food security variable, foodsecurity and assignment to the graduation program, graduation using simple linear regression, with standard (homoskedasticity-assumed) standard errors. How do your t-statistics compare to what you found in the previous t-test? What was the impact of assignment to the graduation program on food security, based on your regression?
Re-estimate your regression, and this time adjust your standard errors to be heteroskedasticity-robust. Fill in the chart below with your estimates.

Variable	Estimate	Variable	Estimate
$\hat{\beta_0}$		$\hat{\beta_1}$
$R^2$		$TSS$
$ESS$		$SSR$
d.f.		$SER$

After that regression estimate, generate a new variable, predict_fs equal to the predicted value of your food security variable. Generate a second variable, resid_fs equal to the residual.
What is the mean of each variable? How does the mean of predict_fs compare to mean of foodsecurity in your sample?⁷
Examine the predicted value of your food security variable, predict_fs, for the youngest person in your sample.⁸ What is its residual?
When we estimate a linear regression with no coefficients, sometimes we’ll say we are “regressing on a constant.” Regress foodsecurity only on a constant. What is $\hat{\beta_0}$, and how does it compare to overall mean?
For this final step, I’d like you to play around with the data. Pick one continuous dependent variable and one continuous or binary independent variable.⁹ You can look at the correlation between two variables, or you can look at the impact of one of the program dimensions (group coaching, group livelihood, etc) on a continuous outcome of interest.

a. Write a population model you want to estimate.

b. Estimate it using OLS, adjusting your standard errors to be heteroskedasticity-robust. Write an equation that reflects your estimated model in the form $\hat{y}=\hat{\beta_0} + \hat{\beta_1}x$, replacing $y$ and $x$ with your chosen variables and replacing $\hat{\beta_0}$ and $\hat{\beta_1}$ with your estimates.

c. In 1-2 sentences, what do your results tell you, collectively?

Submission checklist

Answers file (with your scatter plot and any tables you used)
Do-file with comments for each question
Log file that matches your do-file commands
log close at the end

Video Recording

Post-estimation commands must be run immediately after a regression, while the regression results are still held in your local variables. ↩︎
Here, newvar equals $\widehat{newvar_i} = \widehat{y_i} = \widehat{\beta_0} + \widehat{\beta_1}x_i$ ↩︎
Here, newvar equals $\widehat{newvar_i} = \widehat{u_i} = y_i - \left(\widehat{\beta_0} + \widehat{\beta_1}x_i\right)$ ↩︎
There are a few variables here, including treatment_arm ↩︎
Not fsec7, which is categorical, or fsec which is always equal to 1 ↩︎
Hint: ttest var1, by(var2) will run a t-test of whether the mean of var1 is equal for two groups determined by var2. ↩︎
If they differ, you should make sure you have dropped all missing values of foodsecurity! Try sum predict_fs foodsecurity to see if the sample sizes are the same ↩︎
Now is a good time to try out lookfor age ↩︎
Categorical variables that take on a just few observations, like the identity of your head of household, won’t work here. You’ll need to tabulate the variables to see what you’re working with ↩︎

Lab 2: Do-files

Tue, 27 Jan 2026 00:00:00 +0000

Lab Content

Print-friendly pdf

Materials

The data file acs2024_2pct.dta
Do-file template econ3500_lab_template.do

Download these and save in your lab folder (perhaps you named it something like econ3500/labs?)

👁️ If your do-file opens in a browser tab, you may want to instead Right click and select “Save Link As” 👁️

Objectives

By the end of this lab, you should be able to complete the following tasks in Stata:

Create, run, and save a do-file
Explore variables and generate new ones
Be able to find help with Stata issues - find new commands, check and debug your work, etc.

You will begin by loading the template do-file, dropping unused variables, and reporting how many variables remain. Then we are going to look at sample characteristics before excluding everyone under 23 and keep that restricted sample for the rest of the lab. Then, we will compare income and wages across age and gender groups and construct a post-secondary education indicator to investigate how the gender wage gap interacts with educational attainment.

Before you start typing commands, skim the dataset we will work with by opening it in Stata. We now use acs2024_2pct.dta, a 2.5% subsample of the 2024 American Community Survey.

Key commands

command	description
Viewing data
`tab var1`	tabulate one variable, `var1`
`tab var1, missing`	tabulate `var1`, include missing values
`tab var1, nolabel`	tabulate `var1`, show values rather than labels (if applicable)

Summarizing data
`tabstat var1`	calculate mean of `var1`
`tabstat var1,by(var2)`	calculate mean of `var1` separately for each value of `var2`
`tabstat var1,by(var2) stat(mean count p25 p50 p75)`	calculate mean of `var1` separately for each value of `var2`, with added statistics
Changing your data
`gen newvar =var1`	generate a new variable, `newvar`, and set it equal to values of `var1`
`gen newvar =1 if var2 == [exp]`	generate a new variable, `newvar`, and set it equal to 1 if `var2` equals some expression, and missing otherwise
`gen newvar = var2 == [exp]`	generate a new variable, `newvar`, and set it equal to 1 if `var2` equals some expression, and 0 otherwise
`drop var1 var2`	drop the variables `var1` and `var2`.
`drop if [exp]`	drop observations for which `exp` is true
`keep var1 var2`	drop everything but `var1` and `var2`.
`keep if [exp]`	keep observations only if `exp` is true
Displaying your data
`graph twoway histogram var1`	make a histogram for `var1.` Check help files for more options

Looking for more examples? Check out these Stata Cheat Sheets

Suppose I asked you to recreate your analysis from Lab 01. How long would it take you? If you used a do-file, you would just have to click a button, because your analysis would be replicable. We’re going to learn about the glory of do-files and a few other descriptive statistics tricks.

The instant gratification of the Command window is tempting, but getting comfortable with do-files will save you lots of time, make collaboration easier, and reduce errors!

Aside: Bad documentation, big problems

For an economist, the ﬁve most terrifying words in the English language are: I can’t replicate your results.But for economists Carmen Reinhart and Ken Rogoﬀ of Harvard, there are seven even more terrifying ones: I think you made an Excel error.

– Matthew O’Brien, The Atlantic (18 April 2013)

A summary from The Conversation, (22 April, 2013)

Reinhart and Rogoff’s work showed average real economic growth slows (a 0.1% decline) when a country’s debt rises to more than 90% of gross domestic product (GDP) – and this 90% figure was employed repeatedly in political arguments over high-profile austerity measures…

The most serious was that, in their Excel spreadsheet, Reinhart and Rogoff had not selected the entire row when averaging growth figures: they omitted data from Australia, Austria, Belgium, Canada and Denmark.

In other words, they had accidentally only included 15 of the 20 countries under analysis in their key calculation.

When that error was corrected, the “0.1% decline” data became a 2.2% average increase in economic growth.

So the key conclusion of a seminal paper, which has been widely quoted in political debates in North America, Europe Australia and elsewhere, was invalid.

Excel error (Business Insider)

Do-files and the do-file editor

You can get pretty far in Stata relying on the Command and Review window, but we may want a record of the commands we want to run for our analysis. One thing that makes Stata different from a program like Excel is that you can create do-files, essentially small programs that will run your analysis again and again, in exactly the same way. For econometric analysis this is CRUCIAL.

A do-file can be written in any text file and then saved with the extension .do, but we’ll use the do-file editor. You can start a new do-file by clicking on the do-file button. Or, you can open the do-file template.

The do-file editor is where we will write our programs, and it has some nice color coding to help us avoid mistakes. For your problem sets and papers, you must ALWAYS submit a do-file along with your results. Some people will like to practice in the Command window and then copy the commands they’re satisfied with to the do-file, while others will prefer to work entirely in the do-file. It’s your call, though the second one is a little less risky.

Comment, comment, comment

Do-files are used to record your past work and possibly to share your work with others. It’s important to properly document your work using comments. There are three ways to comment

Comment the whole line with an asterisk
Comment the whole line or part of a line with two forward slashes (//)
Use slash-asterisk to open (/*) and close (*/) a comment section

The do-file editor will turn all your comments green so you don’t get confused.

Programming tips

Put everything in a do-file! An important feature of any good research project is that the results should be reproducible. For Stata the easiest way to do this is to create a text file that lists all your commands in order, so anyone can re-run all your Stata work on a project anytime. Such text files that are produced within Stata or linked to Stata are called do-files, because they have an extension .do (like intro_exercise.do). These files feed commands directly into Stata without you having to type or copy them into the command window.

Imagine you’re just about done with the analysis for your research paper. While working on the final regression, you discover that one of your variables wasn’t cleaned properly, and you need to drop some outliers from the data. Do you correct it and redo everything from scratch? Could you even do that? How long would it take?

With a set of do-files, all you have to do is correct the variable early in the code, and re-run everything. If your code is quick, it will take just a few minutes. Easy!

An added bonus is that having do-files makes it very easy to fix your typos, re-order commands, and create more complicated chains of commands that wouldn’t work otherwise. You can now quickly reproduce your work, correct it, adjust it, and build on it.
Log your results. Maintaining logs can help you quickly retrieve results and serve as a record of past work in case you accidentally overwrite commands. Logs contain the commands and the results.
Never overwrite your original files. A good do-file structure starts with your original, raw data, then cleans and analyzes it to get your final results. A “master” do-file can piece all these together.
Replicability is key. Your code should be replicable to someone else who picks up your raw files and code.
Comment, comment, comment! Clear commenting is essential to help others understand your code and to remember what you did.

Finding new commands

One of the strengths of Stata is that complicated processes can be completed with simple commands. One of its weaknesses is that it’s not always obvious what those specific commands are. In our problem sets and your research paper, you will (I promise) have to calculate or estimate something in a way we haven’t covered.

Stata help file: help command
Search Stata documentation: findit keyword
Google/ChatGPT the thing you are trying to do

Lab Exercise 2

What do I submit?

Your written-up answers to exercise questions (1) - (11). This can be typed or written out, then scanned (or photographed). If scanning, please upload as a .pdf, not a .jpg or .png!
- Please put your answers in a separate file rather than your do-file. This makes it easier for us! Also, you’ll need to include at least one figure, which you cannot paste into a do-file.
The do-file you’ve created that runs this analysis
A log file that contains the results from this exercise.

Questions

If you haven’t yet done so, download our dataset, acs2024_2pct.dta, and the do-file template econ3500_lab_template.do. Move them to your labs folder
Open econ3500_lab_template.do and run it. Does it work? Probably not! Fix it until you can run the file from start to finish with no errors.
Drop some variables we don’t need right now: gq, serial, and hhwt. How many variables remain?
What is the age distribution of the sample? Specifically, report the mean, median, minimum, and maximum age of the sample.
Because very young workers might still be in school, drop anyone in your sample who is less than 23 years old (maintain this sample restriction for the rest of the lab). How many people are left in your sample?
Generate a new variable, lt35, that is equal to one if a person is less than 35 years old and 0 otherwise. What is the mean of lt35 and what is its interpretation?
Using the tabstat command, find the average income and wages for hose under age 35 and those at least age 35. How does it compare to the median income and wages for each group?
Using the tabstat command, find the average income and wages for men and women.
There are several reasons why men might earn more than women. Suppose you hypothesized that men have completed more education than women, and workers with higher education levels earn more. We will test this in two ways.

a. First, generate a variable equal to one if a person has completed at least some post-secondary education, and zero otherwise. What is the mean of this variable?

b. What share of men have at least some post-secondary education? What about women?

c. We can also see if gender-wage gaps are bigger for lower vs. higher-educated workers. For those without post-secondary education, what is the average wage gap? For those with post-secondary education, what is the average wage gap?

d. Use the lt35 indicator you already created to compare the gender wage gap for younger workers (under 35) and older workers (35 and over). Does the gap appear larger in one age group? What might that tell you about experience or life-cycle effects?
Name two additional reasons that may explain why men’s income is higher than women’s income on average. How would you test each one? You do not have to actually do this test, just describe in as much detail as possible. You can assume you have additional data beyond what is provided here.
Make two histograms, one of the income distribution for men and one of the income distribution for women. Make sure the y-axis indicates the “fraction” of individuals, not the density. Copy and paste it into your responses.

Video Recording

Problem set 1

Fri, 16 Jan 2026 00:00:00 +0000

Welcome

Welcome to Problem Set 1! There are some “problem” exercises and one extended Stata exercise. You’ll need to submit two things on Brightspace: a problem set and log file. If you have trouble with the Stata basics, head back to Lab 1.

Tip: If, after doing these problems, you still want more practice, the odd-numbered Exercises (not Empirical Exercises) in Chapters 2 and 3 of Stock and Watson are quite useful, and solutions are available online.

Note that there are superscripted numbers¹ throughout the page that provide additional information/suggestions to help you out.

Exercises


X	-1	0	1	4
$P(X=x)$	0.25	0.30	0.40	0.05

Consider the above random variable, $X$, with its associated probability distribution:

a. Draw the probability distribution function and the cumulative distribution function. (That is, you should make a figure/graph!)

b. What is the expected value of X? That is, what is $E[X]$?

c. What is the variance of X?
The following table gives the joint probability distribution between employment status and college graduation among those either employed or looking for work (unemployed) in the working-age U.S. population for 2017:

Unemployed ($Y=0$) Employed ($Y=1$) Total

Non-college grads ($X = 0$) 0.026 0.576 0.602

College grads ($X = 1$) 0.009 0.389 0.398

Total 0.035 0.965 1.000

a. Compute $E[Y]$.

b. The unemployment rate is the fraction of the labor force that is unemployed. Show that the unemployment rate is given by $1 - E[Y]$.

c. Calculate $E[Y|X=1]$ and $E[Y|X=0]$.

d. Calculate the unemployment rate for college graduates and for non-college graduates

e. A randomly selected member of this population reports being unemployed. What is the probability that this worker is a college graduate? A non-college graduate?

f. Are educational achievement and employment status independent? Explain.
Compute the following probabilities²:

a. If $Y$ is distributed $N(1,4)$, find $Pr(Y \leq 3)$

b. If $Y$ is distributed $N(3,9)$, find $Pr(Y >0 )$

c. If $Y$ is distributed $N(50,25)$, find $Pr(40 \leq Y \leq 52)$

d. If $Y$ is distributed $N(5,2)$, find $Pr(6 \leq Y \leq 8)$
For a randomly selected county in the United States, let $X$ represent the proportion of adults over age 65 who are employed (the elderly employment rate). Then, $X$ is restricted to a value between zero and one. Suppose that the cumulative distribution function for $X$ is given by $F(x) = 3x^2 - 2x^3$ for $0 \leq x \leq 1$.

a. What is the probability that the elderly employment rate is at least 0.5 (50%)?

b. What is the probability that the elderly employment rate is between 0.4 (40%) and 0.6 (60%)?

	Unemployed ($Y=0$)	Employed ($Y=1$)	Total
Non-college grads ($X = 0$)	0.026	0.576	0.602
College grads ($X = 1$)	0.009	0.389	0.398
Total	0.035	0.965	1.000

In any year, the weather can inflict storm damage to a home. From year to year, the damage is random. Let $Y$ denote the dollar value of damage in any given year. Suppose that in 95% of the years, there is no damage ($Y=0$), but that in 5% of the years, $Y = 20000$.

a. What are the mean and standard deviation of the damage in any year?

b. Consider an “insurance pool” of 100 people whose homes are sufficiently dispersed so that, in any year, the damange to different homes can be viewed as inddependently distributed random variables. Let $\bar{Y}$ denote the average damage to these 100 homes in a year (i) What is $E[\bar{Y}]$? (i) What is the probability that $\bar{Y}$ exceeds $2000?

Grades on a standardized test are known to have a mean of 1000 for students in the United States. The test is administered to 453 randomly selected students in Florida; in this sample, the mean is 1013 and the standard deviation ($s$) is 108.

a. Construct a 95% confidence interval for the average test score for Florida students

b. Is there statistically significant evidence that Florida students perform differently than other students in the United States? How do you know?

For the following question, make sure you submit your log-file alongside your answers!

Download countymurders.dta to answer this question.³ The variable $murders$ is the number of murders reported in the county. The variable $execs$ is the number of executions that took place of people sentenced to death in the given country. Most states in the United States have the death penalty, but several do not.

a. Keep only data from the year 1996. How many counties are there in the data set? Of these, how many have zero murders. What percentage of countries have zero executions?

b. What is the largest number of murders in a county? What is the largest number of executions in a county?

c. Compute the correlation coefficient $r$ between murders and execs and describe what you find.⁴ Estimate the correlation coefficient between murdrate and execrate. Why do the two coefficients differ so much?

d. What are two characteristics in the data that are highly correlated with county murder rates?⁵ What are their correlation coefficients?

e. What is median real per-capita income?⁶

f. Generate a variable, highinc that is equal to 1 if a county has above-median real per capita income, and 0 otherwise. What is $E[rpcpersinc | highinc =0]$? What is $E[rpcpersinc | highinc =1]$?

g. Consider a two-sided hypothesis test of whether murder rates are different between counties with high (above median) vs low (below median) real per-capita personal income. Assume the two samples are independent, with equal variances. a. First, write a null and alternative hypothesis b. Use Stata to conduct the hypothesis test. What is the relevant t-statistic?⁷ c. Can you reject the null hypothesis at the 5% level?

h. Generate a variable, perc1029, that is equal to the share of the population between the ages of 10 and 29. What is the median share of the population by county that is ages 10-29?

i. Generate a variable, young, that is equal to 1 if a county has an above-median share of the population that is age 10-29, and 0 otherwise. What is $E[perc1029| young = 0]$? What is $E[perc1029| young =1]$?

j. Consider a two-sided hypothesis test of whether murder rates are different between states with a high (above-median) share of the population ages 10-29 versus a low share. Assume the two samples are independent, with equal variances. a. First, write a null and alternative hypothesis b. Use Stata to conduct the hypothesis test. What is the relevant t-statistic? c. Can you reject the null hypothesis at the 5% level?

Sources

countymurders.dta

Source: Compiled by J. Monroe Gamble for a Summer Research Opportunities Program (SROP) at Michigan State University, Summer 2014. Monroe obtained data from the U.S. Census Bureau, the FBI Uniform Crime Reports, and the Death Penalty Information Center.

See?! Neat! :) ↩︎
Remember that we conventionally write $N(\mu,\sigma^2)$, so the second term is the variance, not the standard deviation. ↩︎
Remember to move this file into your working directory! ↩︎
Remember, you can use correlate var1 var2 to look at the correlation between two variables. ↩︎
If you want to look at the correlation between lots of variables, you can use correlate var1 var2 ... var99. If you want to refer to a lot of variables, an asterisk (*) can act as a “wild.” So if you use correlate var*, you’ll receive a correlation matrix of every variable with a name that starts with “var.” If you use correlate *var*, it will give you a correlation matrix of every variable with the letters “var” somewhere in the name. ↩︎
tabstat is your friend! ↩︎
The help file for ttest will be useful. Here we are conducting a two-sample t-test using groups. You will want to use the highinc variable you generated earlier. ↩︎

Lab 1: Introduction to Stata

Tue, 13 Jan 2026 00:00:00 +0000

Download print-friendly version (pdf)

Materials

driving_2004.dta

Objectives¹

By the end of this tutorial you should be able to complete the following tasks in Stata:

Identify key areas of the Stata interface
Open a data file
Understand what a working directory is
Summarize and tabulate data
Make a variable
Create and save a log file
Open, view, and save a data file
Ask Stata for help

If you need more help, check out Stata Resources.

For the hardcore R users in this class who prefer to use R throughout, you may complete this lab in R. But, it could also be fun to learn a little Stata!

General command structure

do {something} ... with {variable(s) x}...if {something is true..}, options

Note: In this lab, you may type commands directly into the Command window. Later in the course, we will use do-files, which allow you to save and rerun your code. For now, your log file will serve as a record of your work.

Key commands

command	description
`log using logfile1.log`	open and log using `logfile1.log`
`log close`	close log
`use dataset.dta, clear`	open dataset `dataset.dta` , clear out old one
`describe var1 var2 ...`	charcteristics of `var1`, `var2`, etc.
`browse var1 var2 ...`	open data browser, display `var1, var2 ..`
`lookfor text1`	search for text1 in variable names/descriptions
`tabulate var1`	make a frequency table of `var1`.
`tabulate var1 var2`	make a cross-tabulation of `var1` and `var2`.
`summarize var1`	descriptive statistics for `var1`.
`summarize var1 , detail`	detailed descriptive statistics for `var1`.
`gen var1 = binexp`	generates a variable `var1` equal to 1 if binary expression true, 0 otherwise
`replace var1 = 0 if binexp`	replaces `var1` to be 0 if binary expression true, nothing otherwise
`help command`	open help files for `command`.

Logic statements

These are some common logical statements

operation	command
and	&
or	\| (vertical bar, on same key as “/")

equal to	==
not equal to	!=

greater than	>
less than	>
greater than or equal to	>=
less than or equal to	<=

tab bac10 if gdl==1 & sl70plus == 0
- Tabulates the variable bac10 but only if gdl equals one and sl70plus equals 0
tab bac10 if year >=2000
- Tabulates the variable bac10 for the years 2000, 2001, 2002, etc.
tab bac10 if year !=2000:
- Tabulates the variable bac10 for every year but 2000
tab bac10 if year < 2008 & year > 2005
- Tabulates the variable bac10 2006 and 2007
tab bac10 if year < 2008 | year > 2005
- Tabulates the variable bac10 is less than 2008 OR greater than 2005 (all years!)

Also: you can use parentheses to group terms appropriately. For example, if you want to tabulate states where the speed limit is 55 or 65 AND the blood alcohol limit is 0.10, then this is wrong:

tab state if sl55 == 1 | sl65 == 1 & bac10 == 1

But this is correct!

tab state if (sl55 == 1 | sl65 == 1) & bac10 == 1

Thanks, parentheses!

Guided instructions

Hey, Stata. It’s nice to meet you

Start by opening Stata. You should have a window that looks something like this (on a PC):

You should now have the Stata window open. There is a set of pull down menus as well as 4 smaller windows: Review, Variables, Results, and Command.

Also especially helpful are the following buttons:

Log files

If you want to record anything that you do in a Stata session so that you can look at results or commands later, you need to open a log-file. A log-file is simply a record of all the commands you enter into Stata and the output from those commands. The key is to make sure you have a log file open at the beginning of a Stata session, and to close it once you have finished, and before you close Stata.

There are three ways you can open a log file:

Go to the FILE dropdown menu, choose Log, choose Begin. You should see a “Begin Logging Stata Output” dialog box. Browse to a directory where you can store your log file and type in the following file name in the File Name space: lab1.log
Click on the log icon at the top of the Stata workspace (right of the print button). When you click on the log button, the “Begin Logging Stata Output” dialog box pops up. Name your log file as above.
You can open a log file by typing the following in the Stata command window: log using lab1.log, replace

The , replace is optional. If you add it as an option, your new file will overwrite your old one. Or, you can add , append to add it to the bottom of your old log file.

Tip: Use extension .log, NOT the default .smcl. This will make it easier for you to edit, cut and paste your log in any text editor.

Now that you have a log file open, we can start our STATA session.

Working directories (important!)

Stata looks for data files in its working directory. To see your current working directory, type:

pwd

If your data file is not located there, Stata will not find it. You can change your working directory using:

cd "/path/to/your/folder"

Once you run this command, Stata will make this working directory its starting point, but only for the rest of the session. The next time you open Stata, you may need to repeat the process.

I recommend creating a folder for this class (e.g., econ3500/labs/) and saving both your data and log files there.

Opening data files

Stata data files end with the extension .dta, and they can only be read by Stata. You can import text files and excel files into Stata, and you can export .dta files into text files or Excel files, but we’ll cover this later.

There are three ways to open a data file:

Outside Stata, double click on the data file you want to open
Use the FILE/OPEN drop down menu in Stata and open the data set that you copied into your folder. Note that in the command window, the use command appears. We’ll use that one later.
Type use filename.dta, clear into the command window within Stata. The option , clear tells Stata to remove any data currently in memory. Stata can only hold one dataset at a time.²

Download driving_2004.dta and open it. I recommend moving it to your brand new class folder first. It is a file of driving laws, vehicle accidents, and fatalities in the United States in 2004.

You should now see the list of variables appear in the Variables window, with the variable name, variable label, and some other information.

Looking at data

Let’s take a more detailed look at the variables in the dataset.

In the command window type: describe

At the top of the output you will see some overall features of the file, including the number of variables. Below that you will see a list of every variable, including the variable name, the “storage type” (byte, float, int, etc.) and the variable label. If you see –more– at the bottom of your screen, press the space bar to continue scrolling.³

To learn more about the variables and the organization of the data, use the browse command. Type: browse (or click on the “browse” button).

Another approach is to add a variable list to the browse command. Type the following:

browse year sl70plus bac10 bac08 gdl

Again, note that you can also double click on the variable names so you don’t have to type them all!

This command directs you to a spreadsheet inside Stata where the data appears. This looks a lot like an Excel spreadsheet!

Note the following:

Each observation appears on a separate row of the spreadsheet, which represents data from a certain year and a certain state. For example the first row is for state 1 (Alabama) in 1980. If you move along the row, you can see other characteristics about Alabama in 1980.
Each variable appears in a separate column, and the name of the variable is at the column heading.

How many observations are there? What type of data set is this?

Examining variables

Let’s look at the variables that are included in the data set. There is an efficient way to find the names of variables you are interested in. Suppose you are interested in a variable related to alcohol laws. Type in:

lookfor alcohol

This will give you a list of all the variables that have “alcohol” in either their variable name or variable label. In this case, two variables appear - bac10 and bac08.

You can also experiment with all possible combinations of the col, row, and cell options, and add the nofreq option to suppress the number of observations. Use help for details:

help tab

When you are analyzing variables, you will want to think carefully about whether you should be looking at row percentages, column percentages, or cell percentages.

Creating new variables

You can create new variables using the generate command. For example:

gen highfatal = fatal_rate > 1.5

This creates a variable equal to 1 when the condition is true, and 0 otherwise.

You could create the same variable in a slightly different way:

gen highfatal = 1 if fatal_rate > 1.5
replace highfatal = 0 if fatal_rate < 1.5

Videos

Lab 1 Overview

Lab Exercise 1

First, work through the above steps. Then, work through the 7 questions below.

What do I submit?

Your written up answers to exercise questions (1) - (7). This can be typed or written out then scanned (or photographed), in any reasonable format
A log file that contains the results from the steps prior to the exercise and the exercise itself.

If you struggled or explored, this might get excessively long! Three choices (1) submit it anyways, (2) open it in a text editor manually delete the nonsense, (3) close your log file and start a new one, and this time run through your code with less backtracking. Option (1) is completely fine.

Questions

How many states have graduated drivers license laws (GDLs)? How many states have speed limits of 70 mph or higher (including no speed limit)?
What percentage of states with GDLs and with low speed limits (below 70 mph) have blood-alcohol limits of 0.10 (the more lenient level)? Note that some states have blood-alcohol limit for a fraction of a year. If so, consider having a limit of 0.10 in place for part of the year as having a limit
What is the mean fatality rate per 100 million miles across all states? What is the standard deviation?
What was the fatality rate (deaths per 100 million miles) in Vermont? (Vermont is state 46)
Generate a variable $Y$ equal to one if a state has a fatality rate per 100 million miles that is above the mean, and zero otherwise. What is $E(Y)$?
Write a joint probability distribution table for the following two random variables: $X$, a random variable equal to one if a state has a speed limit of 70 or greater and zero otherwise (see sl70plus), and $Y$, the random variable developed in the previous part.
Look up the command correlate in the help files: What is the correlation coefficient between nighttime fatalities per 100,000 population and weekend accidents per 100,000 population? Why might this correlation be so strong?

This lab draws heavily on Anne Fitzpatrick’s (UMass-Boston) excellent materials. ↩︎
Yes, this is a pain. ↩︎
If you are tired of dealing with the “more” issue, you can enable set more off into the command window to enable continuous scrolling for your session. If you’re just done with it, try set more off, perm to enable continuous scrolling for this and all future sessions. ↩︎

Research Paper: Idea Proposal

Sun, 11 Jan 2026 00:00:00 +0000

The assignment

Come up with three research ideas and give me a bit of detail. It should include the following, and will likely span 2-3 paragraphs per research idea:

A research question
A hypothesis/hypotheses about that research question
Proposed data set (within the IPUMS universe).
For example: “I will use the 1990 and 2000 U.S. Census, focusing on the Northeast.” But not: “CPS.”
Happy for you to propose a non-IPUMS research question, but you must include at least one IPUMS-based question.
A rough plan of analysis. How will you answer your research question? What key varialbes will be important?

All ideas should meet the basic criteria in our assignment overview (relevant to economic theory, answerable using data, use cross-sectional or panel data)

Coming up with an idea

For some of you, this may be the best. For me, being told, “think of an idea … go!” is the WORST. That’s okay!

So here’s some advice.

Open up IPUMS and start digging through variables and datasets. What looks interesting?
Read Nick Huntington-Klein’s excellent chapter on research questions^[The rest of it is great, too!]
Jot down every question you can think of. If you need inspiration, open up a newspaper of your choice. I highly recommend the NYTimes Upshot, which has lots of data-driven, economics-linked questions.

Start with some suggested research ideas and iterate from there.

Rubric

You will be graded on four criteria for each question (8 points per idea, 24 points total):

	Partially meets	Fully meets
Research question is clearly stated, specific, and answerable	1	2
Hypothesis proposed and explained	1	2
Feasible data set(s) explicitly identified	1	2
Rough description of how will test hypothesis to answer research question	1	2

Research Paper: Overview

Tue, 06 Jan 2026 00:00:00 +0000

One main product of the course is an original research paper you’ll produce that incorporates econometric data using the methods we’ve learned in class.

You can work alone or in pairs.^[No groups of three!]

Learning objectives

Develop clear, answerable research questions and link them to economic theory.
Identify and apply appropriate econometric methods to answer research question, recognizing necessary assumptions and limitations
Conduct and interpret original data analysis using Stata
Strengthen written and oral communication skills

Overview of requirements

All the nitty-gritty is below, but here’s a general sense of what I’ll ask you to do:

You will write a journal-style paper in which you ask and answer an economic question, relying on regression analysis that you conduct with cross-sectional or panel data.
You’ll apply the various econometric techniques we’ve worked on throughout the semester.

The assignment specifications depend on whether you are working alone or in pairs:

—	Alone	Pairs
Word count	2500-4500	3500-5500
Tables	3-4	4-5

If you work in pairs, you will submit all assignments jointly, except for your referee report.

Topic selection

Select a research question that is interesting to you and answerable with data that you can obtain. Your question should accomplish the following.

It must have clear relevance to economic theory.
It must be answerable using data (with a sample size of at least 100, ideally much (much) higher!)
It may not be an exact replication of previous work. It may, however, be an extension.
It must use cross-sectional or `panel data. There are lots of interesting time-series questions, but we will not cover these topics in ECON350. I highly, highly recommend that work with data from IPUMS. I am open to other large data sets ($n>100$, ideally $n>1000$), but these require prior approval.
A strong paper will make a reasonable attempt at causal identification.

Where folks get into trouble: The biggest problem I see is that people get very excited about an interesting question, but they don’t necessarily have the data they need. In order for us to be able to have reasonable standard errors and good asymptotic properties, I recommend questions that (a) have a data set you can access and (b) have at least 100 observations. I would strongly discourage you from any country or state cross-sections that don’t also have a time component. I also would be wary about studies predicting athlete performance - our methods work well, but the link to economics is often tenuous.

A quick note on GenAI

Now is a good time to refer to our class Gen AI policy! TLDR; ChatGPT can be your friend, but no more.

Research paper process

Research ideas

Prepare a set of 3 research ideas. An “idea” only needs to consist of about 2-3 paragraphs, which should include a research question, a hypothesis, a proposed data set, and a rough plan of analysis for testing your hypothesis.

Annotated bibliography

To make sure your question ties closely to the economic literature, you’ll prepare an annotated bibliography that identifies useful papers and summarizes them in relation to your question.

Research proposal

From the list of topics, choose and develop one research idea for your research proposal. You’ll write up research proposal of at least 1200 words. This proposal should provide as much detail as possible to help me and your classmates assess your plan and provide useful feedback.

Peer review

A classmate will provide a peer review of your proposal, providing feedback to help you turn your proposal into a final paper

Rough draft (optional)

You may submit one rough draft to me for comments. This is optional, but I highly recommend you do it, because the early deadline can help you stay on track, and you’ll have a chance to get an early sense of how things are going.

Presentation

You’ll make a brief (6-8 minute) presentation of your paper in the final week of class. I will provide specifics later.

Final submission

Your final draft will be due on May 04. Please make sure to review all the requirements carefully!

Paper components

A number of excellent guides can help you put together an effective and interesting research paper. I’ve provided a set of paper resources.

Your paper should include the following elements.

Abstract & Title

You’ll need a title and an abstract

Descriptive title
Abstract that summarizes the paper and findings in 250 words or fewer

Introduction

In an economics paper the introduction stands alone!

That is, a busy (or tired) person could read the introduction and understand what you did, what you found, and why it matters. Our papers are not mystery novels—there’s no need for a plot twist on page 8!

I recommended following introduction formula, which is written for folks writing a longer academic paper, but the principles are still solid.

Guidelines and structure

Introduction reads like an academic article. Motivates, describes what you do and what you find. (Almost like a mini-paper!)
- Reader can infer all main points of paper just from introduction
States your research question clearly
Explains what economic theory says about the potential answers to your questions, and/or defines clear hypotheses that you test
Describes why your topic is important
Describes what you do
Describes what you find
Describe how it contributes to our knowledge

Background/Literature Review

What you include here depends on topic. Sometimes the reader needs to know how your question links to economic theory. Sometimes it’s more important to know specific context first, and then to turn to the literature. Sometimes it’s most important to summarize what the literature already knows. Your call.
At the back of your mind, when motivating your paper, ask “what is the link to economics”?
- If studying discrimination, what does economic theory tell us about why discrimination exists/persists
- If studying stock market returns, what do economic models tell us about our ability to predict returns?
Includes papers that have answered your research question (or similar research questions)
Research results described in present tense (“Smith finds,” not “Smith found”)
Papers are put in context. That is, rather than just listing paper A and finding, paper B and finding, etc, you link each one (or group) to their contribution (as relates to your research question)

Methodology/Data

Describe the data you use, where did it come from? If you didn’t create it, cite it
What is the unit of observation? Is it people, households, states, etc? Make sure the unit is appropriate to your question
If you’re working with individual-level data, what is the age range you want in your sample? What years of data do you need?
If dealing with labor force variables, do you want all people of working age, all those who are in the labor force, or all who are employed?
Describe your methodology. Are you estimating a model using OLS? If so, say so.
Clarify whether we are looking at causal estimates or something else. What are the estimated parameters of interest? What do they mean?
Correct standard errors: robust? Clustered? Something else?

Population model

Write out your population model!

If you’re using Word, use equation editor. Make it look nice.
Don’t forget the error term!
Use proper equation notation ($\beta$, $u$, etc)
Use appropriate subscripts ($i$, $t$, $y$, etc)
All relevant variables explained/defined
Use descriptive variable names when possible (ie use $female$ for women, not $w1$)
Make sure your variables are written correctly - an equation like $wage = \alpha_0 + \alpha_1 race$ doesn’t make sense - race isn’t continuous!
If you are using a lot of categorical variables and find it awkward to write them out, you can simplify:
- Showing that you have state fixed effects:
$$y_{st} = \beta_0 + \beta_1 X_{st} + \beta_2 Z_t + … + f_s + u_{st}$$

and the in the text, “…where $f_s$ is a vector of state fixed effects”
- Including a set of occupational dummy variables $$y_{st} = \beta_0 + \beta_1 X_{st} + \beta_2 Z_t + … +\sum^K_{s=1}\delta_SD_s + u_{st}$$
and in the text, “…where $D_k$ is a dummy variable for occupation $s$, from $s \in [1,S]$” (or something in that general spirit)

Results

When using categorical/dummy variables, what is your omitted category? Make sure you know and that it’s clear.
What are the units of your measures? Is that percent or percentage points?
Discuss using a reasonable number of decimal places (usually only 1 or 2)

Limitations or Discussion

Include as a separate section or integrate into results
What might us from making causal interpretations about our coefficient of interest?
- Omitted variable bias?
- Reverse causality?
- Measurement error?
Are the results externally valid?
What other considerations are important?

Conclusion

Brief summary of paper (yes, another summary)
Limitations (summary of limitations/discussion section)
Implications for policy (if relevant)
Implications for future research

Tables

You will need the following tables:

Descriptive statistics: This will present key information about your datat set that we will need to understand your context. Choose relevant variables to describe, including key dependent and independent variables
Main regression results: This will be a table of your key specifications. You may have the results from a few regressions in the same table. It’s this table that would be the “takeaway” table
Secondary regression results: Results that help dig deeper, consider subgroups, consider related hypotheses or outcomes, etc.

How you arrange regressions between (2) and (3) will depend on how you structure your argument.

Additional tables (especially two-person papers) will extend your analysis through other modeling approaches, other dependent variables, additional displays of robustness.

You may also include figures, but they would not substitute for the required tables unless the figures themselves presents new results.

Please embed tables near the place where they are referenced (rather than at the end)
Tables should be properly formatted. That is, they should be made in Excel (or LaTeX) and NEVER copied and pasted out of Stata raw output
Variables should be described using real words. That is, “number of children,” not `numchld.'
Tables and figures should be numbered (Table 1, Table 2, etc… Figure 1, Figure 2, etc.) and should also be given a title. Refer to tables by their numbers in the text.
Regression tables include standard errors. Use stars to indicate statistical significance. (The Stata package outreg2 is a big help!)
In most contexts, about 3 places past the decimal point is right, but it depends on the magnitudes. If you really want to be precise, set and stick to a reasonable number of significant digits. There’s no place for a number like 0.05403823 or 0.0000000 in your tables.

References

You’ll use outside sources in your introduction and background/literature review, at a minimum. Make sure that you have (1) at least 5 academic sources (published academic journals), and (2) at least 8 sources total (could also include working papers, newspaper articles, policy papers, etc.)
Make sure to cite your data (does not count for totals above)
Use footnotes, not endnotes
At the end of your paper, include list of references cited
You can format using APA, MLA, or Chicago style, but it must be a consistent style
- Citation Owl or Google Scholar will do it for you
- Microsoft Word’s bibliography management system can be hard to work with. Beware!
In-text, cite with author and year (Author, Year; Author, Year) or (Author Year, Author Year)

Replication package

You must also submit materials such that I can replicate your analysis easily. How you do this exactly is up to you, but it must include the following:

Stata do-file that replicates your analysis a. This should be “push-button.” That is, I should be able to load it, adjust the main file path, and run it once to replicate all your results. b. The do-file should therefore declare one a file path at beginning of file rather than using local paths throughout
Raw data file.
Log file that shows your results fron running your do-file
If anything might seem confusing to me, include a Readme file to tell me what to do!

These do not count toward your page limit, and they should be submitted as separate files.

How to submit?

Ideally, zip your files and then upload directly to Brightspace.
If the file is too large, use UVM File Transfer
I’m open to other options depending on your preferences - Github repository, shared Google folder, etc.

Style

Use present test and first-person active voice! (I estimate a regression, NOT “A regression is estimated”)
- Single-authored paper first person singular, “I.” (You’re not the queen!)
- Joint-authored paper first person plural, “we.”
- Don’t believe me? Check out any economics paper published in the past 20 years. There’s some variation in I vs. we, but a lot of active voice.
Divide paper into numbered, labeled sections.
A research paper is not an essay!
- Personal opinions don’t have a place
- Sources should be primarily academic (peer-reviewed journals, working papers, etc.), maybe some non-academic sources for motivation only
- Clear, labeled sections

Deadlines

See course schedule for deadlines. Submit materials by 11:59pm on the deadline unless otherwise specificed. Submit all assignments via Brightspace. (Late assignments without an extension will be penalized 10% per day, and they may not receive detailed feedback.)

Grading

I will provide formal or informal grading rubrics for each component, so you have a clear idea of how you’ll be graded. As the syllabus shows, your total research paper score will account for 35% of your final grade.


Process			30%
	Research ideas	5%
	Annotated bibliography	5%
	Research proposal	10%
	Peer review	5%
	Rough draft	0%
	Presentation	5%
Final draft			70%

FAQ

How does my group size affect grading? The grading rubric is the same regardless of your group sizes. However, I expect that in a larger group, your analysis will go deeper, your review of the literature will be more comprehensive, you’ll have additional robustness or placebo tests, etc. See the page requirements for a guide. If you have questions, feel free to talk with me in more detail.
Can I turn in a paper with 10 pages of text and 3 tables, or 10 pages of tables and 5 pages of text? So long as the word and exhibit count are met, there’s no “right” place to be in that! What matters most is that your paper clearly addresses your research question.
My results aren’t statistically significant, should I start over? NO. Remember that our goal here isn’t to find statistically significant relationships, it’s to answer questions. Let the data speak for itself about what relationships are or are not there.
How should I format my citations and bibliography? Consistently. APA or Chicago is fine. MLA is not.
How much data analysis do I need to do? You should incorporate data analysis to answer your research question or test your hypotheses. You may also use data to provide some descriptive statistics, however that alone would not be sufficient. Exactly how much analysis is involved will depend on the question you pose and your approach to answering it.
Do I have to use Stata? You can use an alternative programmable language like R or Python. Your analysis should not be conducted in Excel. My ability to support your programing in languages besides Stata is more limited.
Do I have to use IPUMS data? I’m open to other possibilities, but it must be approved by me first.
What if I’ve worked on this topic for another class? This can work, but first talk to me so we can figure out a plan that ensures you’re building beyond what you’re already doing.

Recommendations

See paper resources for dataset and topic suggestions.

Mon, 01 Jan 0001 00:00:00 +0000

Version: Spring 2018
EC200 Econometrics and Applications

Problem Set 3\

The following table shows, for eight vintages of select, delicious, wine, purchases per buyer ($y$) and the wine buyer’s rating ($x$) in a given year:

$x$ 3.6 3.3 2.8 2.6 2.7 2.9 2.0 2.6

$y$ 24 21 22 22 18 13 9 6
1. Estimate by hand the regression of purchases per buyer on the buyer’s rating.\
2. Interpret the slope of the estimated regression line.\
3. Interpret the intercept of the estimated regression line .\
(Stock and Watson 4.2) Suppose that a random sample of 200 20-year-old men is selected from a population and that these men’s height and weight are recorded. A regression of weight (measured in pounds) on height (measured in inches) yields

$$\widehat{Weight}=-99.41 + 3.94 Height$$

$R^2 = 0.81$; $SER = 10.2$
1. What is the predicted weight for someone who is 70 inches tall? 65 inches tall?
2. One 20-year-old man has a late growth spurt and grows 1.5 inches over the course of the year. What is the regression’s prediction for the increase in his weight?
3. Suppose that you want to translate the results of this equation into centimeters and kilograms. What are the regression estimates from this new regression? Give all results, including estimated coefficients, $R^2$, and SER.
4. Interpret the $R^2$ value. Does it indicate anything about whether these estimates are likely to be biased? Explain.
(Stock and Watson 5.2) Suppose tha a researcher, using wage data on 250 randomly selected male workers and 280 randomly selected female workers, estimates the following OLS regression:

$$\begin{aligned} \widehat{Wage}=&12.52 + &2.12 Male\ &(0.23) & (0.36)\end{aligned}$$

$R^2 = 0.06$; $SER = 4.2$

where $Wage$ is measured in dollars per hour and $Male$ is a binary variable equal to 1 if a person is male and 0 if female. Define the wage gender gap as the difference in mean earnings between men and women.
1. What is the estimated gender gap?
2. Is the estimated gender gap significantly different from zero?
3. Construct a 95% confidence interval for the gender gap
4. In the sample, what is mean wage of women? Of men?
5. Another researcher uses these data, but regresses $Wage$ on $Female$, a variable equal to 1 if the person is female and 0 if the person is male. What are the regression estimates from this regression? (Include the coefficients, $R^2$, and $SER$.)
  
  $$\begin{aligned} \widehat{Wage}=&___ + ___ ( Female)\end{aligned}$$
  
  $R^2 = ___$; $SER = ___$

Mon, 01 Jan 0001 00:00:00 +0000

Version: Spring 2017
EC200 Econometrics and Applications

Problem Set 4\

Stock and Watson 6.6
Stock and Watson 7.4
Stock and Watson 7.8 (skip part c)
Stock and Watson, Additional Empirical Exercise 6.1
Stock and Watson, Additional Empirical Exercise 5.3
Suppose that average worker productivity at manufacturing firms ($avgprod$) depends on two factors: average hours of training ($avgtrain$) and average worker ability ($avgabil$)

$$avgprod = \beta_0 + \beta_1 avgtrain + \beta_2 avgabil + u$$

Assume this equation satisfies the Gauss-Markov assumptions. If grants have been given to firms whose workers have less than average ability, so that $avgtrain$ and $avgabil$ are negatively correlated, what is the likely bias on $\widetilde{\beta_1}$ obtained from the simple regression of $avgprod$ on $avgtrain$?
Finish and submit Lab 4.

Mon, 01 Jan 0001 00:00:00 +0000

Version: Spring 2018
EC200 Econometrics and Applications

Problem Set 4\

Suppose that $(Y_i,X_i)$ satisfy the three key least squares assumptions and, in addition, $u_i$ is $N(0,\sigma^2_u)$ and is independent of $X_i$. A sample of size $n = 30$ yields

$$\begin{aligned} \widehat{Y} & = 43.2 + &61.5 X \ & (10.02) & (7.4) \ R^2 = 0.54 & SER = 1.52 &\end{aligned}$$

where the numbers in parentheses are the homoskedastic-only standard errors for the regression coefficients.
1. Construct a 95% confidence interval for $\beta_0$.\
2. Construct a 90% confidence interval for $\beta_1$.\
3. Test $H_0: \beta_1=55$ against $H_1: \beta_1 \neq 55$ at the 5% level.\
4. Test $H_0: \beta_1=55$ against $H_1: \beta_1 > 55$ at the 5% level.\
5. Explain briefly why the test of $H_0: \beta_1=55$ against $H_1: \beta_1 < 55$ is trivial. You can use a picture if is helps make things clearer.\
In the 1980s, Tennessee conducted an experiment in which kindergarten students were randomly assigned to “regular” and “small” classes and given standardized tests at the end of the year. (Regular classes contained approximately 24 students, and small classes contained approximately 15 students.)
Suppose that, in the population, the standardized tests have a mean score of 925 points and a standard deviation of 75 points. Let $SmallClass$ be a binary variable equal to 1 if the student is assigned to a small class and equal to 0 otherwise. A regression of $TestScore$ on $SmallClass$ yields $$\begin{aligned} TestScore &= 918.0 + 13.9 &SmallClass\ & (1.6) & (2.5)\ R^2 = 0.01, & SER = 74.6&\end{aligned}$$

where the numbers in parentheses are the standard errors for the regression coefficients.
1. Do small classes improve test scores? By how much? Is the effect large? Explain.\
2. Is the estimated effect of class size on test scores statistically significant? Carry out a test at the 5% level.\
3. Do you think that the regression errors are plausibly homoskedastic? Explain.\
4. $SE(\widehat{\beta_1})$ was computed using the initial formula for standard errors (based on equations 5.3 and 5.4 in Stock and Watson). Would having heteroskedastic errors and using this formula affect the validity of your hypothesis tests? What if the errors are actually homoskedastic? Explain.\
Visit the Stock and Watson webpage (here: http://wps.aw.com/aw_stock_ie_3/178/45691/11696959.cw/index.html) and click on the “Additional Empirical Exercises.” tab. Complete Additional Empirical Exercise 5.3 using the data set CollegeDistance. Note that you can download this data from the Additional Emprical Exercises page.\
Finish Lab 3 - include do-file, log-file, and answers to questions.

Mon, 01 Jan 0001 00:00:00 +0000

Version: Spring 2018
EC200 Econometrics and Applications

Problem Set 7\

In 1985, neither Florida nor Georgia had laws banning open alcohol containers in vehicle passenger compartments. By 1990, Florida had passed such a law, but Georgia had not.
1. Suppose you collect random samples of the driving-age population in both states, for 1985 and 1990. Let $arrest$ be a binary variable equal to one if a person was arrested for drunk driving during the year. Without controlling for any other factors, write down a linear probability model that allows you to test whether the open container law reduced the probability of being arrested for drunk driving. Which coefficient measures the effect of the law?
2. Why might you want to control for other factors in the model? What might some of these factors be?
3. Now, suppose that you can only collect data for 1985 and for 1990 at the county level for the two states. The dependent variable would be the fraction of licensed drivers arrested for drunk driving during the year. How does this data structure differ from the individual-level data described in part (a)? What econometric method would you use?
For this exercise, use JTRAIN.dta to determine the effect of a job training grant on hours of job training per employee. The basic model for the three years is the following: $$\begin{split} hrsemp_{it} &= \beta_0 + \delta_1 d88_t + \delta_2 d89_t +\ & \beta_1 grant_{it} + \beta_2 grant_{i,t-1} + \beta_3 log(employ_{it}) + a_i + u_{it} \end{split}$$
1. Estimate the equation using first differencing. How many firms are used in the estimation? How many total observations would be used if each firm had data on all variables (in particular, $hrsemp$) for all three time periods?
2. Interpret the coefficient on $grant$, and comment on its significance.
3. Is it surprising that $grant_{-1}$ is insignificant? Explain.
4. Do larger firms train their employees more or less, on average? How big are the differences in training?
Use CRIME4.dta for this exercise, and see scanned upload for example 13.9.
1. Replicate the results in Example 13.9.
2. Re-estimate the unobserved effects model for crime in Example 13.9, but use fixed effects rather than differencing. Are there any notable sign or magnitude changes in the coefficients? What about statistical significance?
3. Add the logs of each wage variable in the data set and estimate the model by fixed effects. How does including these variables affect the coefficient on the criminal justic variables in part (b)?
4. Do the wage variables in part (c) have the expected sign? Are they jointly significant?
Finish and submit Lab 6.

Mon, 01 Jan 0001 00:00:00 +0000

Version: Spring 2018
EC200 Econometrics and Applications

Problem Set 8\

9.3, 9.5, 9.6, 9.10 (odd-numbered answers are online, but think through them carefully!)
Additional empirical exercise 9.1, 10.1 - make sure you include a do-file and log-file that reflects your analysis!

Assignment overview | ECON3500: Econometrics and Applications

Problem set 5

Welcome

What do I submit?

Exercises

Problem set 4

Welcome

What do I submit?

Exercises

Problem set 3

Welcome

What do I submit?

Exercises

Lab 8: Instrumental variables

Materials

Objectives

Why instrumental variables?

Data context

Variables we’ll use

Key commands

Conducting IV regressions with ivregress

Outputting your results with outreg2

Workflow overview

Lab 8 Worksheet

What do I submit?

Questions

Research Paper: Presentation

Overview

Guidelines for presentation

Example

Deliverables and due dates

Presentation rubric

Research Paper: Final Submission

Research Paper - Final Submission

What should I submit?

Rubric

Lab 7: Difference in differences

Materials

Objectives

What is panel data?

What is difference-in-differences?

Key commands

Using xtset and xtreg

Adding other fixed effects

Workflow overview

Lab 7 Worksheet

What do I submit?

Part A: Difference-in-differences

Data context

Variables (Part A)

Questions

Part B: Fixed effects

Data context

Variables (Part B)

Questions

Research Paper: Referee Report

Purpose:

Key Elements

What should this look like?

Submission

Research Paper: Rough Draft

Lab 6: Internal validity and LPM

Materials

Objectives

Data context

Variables we’ll use

Key commands

Linear Probability Models

Lab Video

Workflow overview

Lab 6 Worksheet

What do I submit?

Questions

Research Paper: Research Proposal

Objective

Components

Rubric

Submission requirements

Examples

Lab 5: Merging and hypothesis tests

Conducting IV regressions with `ivregress`

Outputting your results with `outreg2`

Using `xtset` and `xtreg`

Looping with `forval`

Looping with `foreach`

What is an annotated bibliography?¹