10 Basic Rules for Data Analysis

Last updated on March 8, 2022

  1. Always save a clean version of your data that you do not touch (especially if accessing your data is difficult) / Never save over your only data file with recoded variables
  2. Always use syntax, so you have a record of what you are doing
  3. Keep your syntax nicely organized
    • Use notes liberally in your syntax, in SPSS and STATA this means starting a line with *
    • Organize syntax into sections, e.g.
      • recodes for the dependent variable(s), then
      • recodes for main independent variables, then
      • other control variables; then
      • descriptive statistics; then
      • t-tests (or chi-square or ANOVA or whatever you’re using to test differences between groups); then
      • main analyses;
      • then sensitivity tests
  4. Always recode your variables (even if you are not transforming them for the following reasons:
    • you get a better feel for the data–how they are coded, missing values, shape of the distributions
    • new variables tend to end up at the end of the list of variables, which makes them easier to find if you’re looking at your list of variables
    • you can give them names that are more intuitively memorable than QA_E84C or whatever they start out as
      • When making dummies, DO NOT label them by the original variable but by the category you have assigned a value of 1 to. E.g. If you turn gender into a dummy, don’t call it “dumgender” if 1=male and 0=female; call it “dummale” instead
  5. Always label variables and values with meaningful words that make sense
  6. Always check your recodes (usually with cross-tabs)
  7. Always check your missing data
    • How many cases are missing?
      • <10% listwise deletion is probably okay
      • >10% perhaps explore multiple imputation or other variables
    • Are they missing at random or is there some kind of pattern to the missing cases?
  8. If a result seems contrary to the literature, question your result before you question the literature.
    • Did you make an error in the coding?
    • Are you interpreting the coding correctly (are high values indicative of high values? e.g. if a scale from 1-5 is from strongly agree (1) to strongly disagree (5), this means higher scores means stronger disagreement. Make sure variables are coded in the direction that you are interpreting them.
  9. After recoding all of your variables, save the syntax without saving the data file. Then reopen a clean data file and run your syntax file again to make sure that you hadn’t accidentally recoded something in your testing stage that got saved incorrectly.
  10. Depending on the number of variables and how big your data set is, you might want to have one syntax file for recoding and then a separate syntax file for analysis. This way you can delete all the variables you’re using and start the analysis with a fresh data set fully recoded, without having to run the recode code every time.
    • Pros: Quicker data runs and easier to find the variables you need
    • Cons: If you realize later that there’s a variable you need to include, you have to go through everything again.

Comments are closed.