Final Research Project

Overview

The course is structured around working with a dataset of your choice. We will begin with setting up a repo for your data and R scripts, then move to tidying the data, then to analyzing and plotting the data, and finally to interpreting the results based on those data. In other words, having an appropriate dataset as early as possible is essential. You will get the most out of this class if you work with your own data for a project you will continue to be invested in after the end of the quarter, like a BA or MA thesis, grant proposal, or manuscript intended for publication. If you are not currently working on this type of project, you should consider taking this class next year (or whenever you do have data). Whether or not you use your own data, you are responsible for providing your dataset.

Undergraduates must receive instructor approval on an existing dataset before receiving consent to enroll.

Dataset

What constitutes an appropriate, ready dataset is hard to comprehensively define. But there are a few core features that the dataset should have for the purposes of this course (and for quantitative/automated analysis and open, replicable science in general).

Complete (at least a complete subpart)

In order to think through the steps from raw data to analysis-integrated manuscript, you need a dataset that covers all aspects of your analysis. That is easy enough if you have completed all data collection and preparation for a study. But what if data collection is ongoing, or even just planned? You should plan to then have a mini version of your complete dataset, whether that is just a complete subset (e.g., pilot data prepped with all the features of the real data to-come-in) or a simulated version of your dataset (i.e., to be substituted later with the real data).

Let’s say, for example, you are planning to run an experiment with the hypothesis that you will see a difference between children of different ages in two conditions. You will at least need (a) real or simulated behavioral/response data for a few participants in each condition and (b) metadata about the participants (e.g., their ages) and experimental sessions (e.g., the condition they were in, the order of the items, etc.). For many people that information might be spread across multiple different files, e.g., one output file from the experiment software for each participant plus a spreadsheet with participant metadata and a spreadsheet with experiment session metadata. For some people that information might already be integrated into one document. Either way is fine, but we need some complete picture of the intended dataset, even if it’s a mini one.

Consistent in labeling and structure

Computers are wonderful, but stupid. If you have a data file for one participant with the field/column header “Name” and another called “name” and another called ” name”, the computer will treat them as different as “apples”, “oranges”, and “bananas”. If you have any inconsistencies in how you have labeled or structured your data, you will surely come across some of them in the process of cleaning up your data for analysis. The scarier part is that you probably won’t come across all of them, leaving dreaded “silent” errors in your analysis. So, early on, you should brainstorm where these sources of inconsistency could possibly arise in your dataset, list them out, and do as much double- and triple-checking as needed to satisfy you that you have, to the best of your abilities, eliminated these issues.

This kind of cleaning will be covered in the first few weeks of the course, but you should ensure that your dataset is as clean as possible before the course begins. You don’t want to realize in Week 4 that actually you have a bunch of missing data because you didn’t realize that the experiment software sometimes recorded the participant ID as “P01” and sometimes as “P1”, or that the “Name” column you thought was the same in three different spreadsheets actually refers to participant names in one, experimenter names in another, and some kind of weird unidentifiable names in the third.

Plain text

R works with plain text. Though it may not seem like it to us human users, documents like Word and Excel files have a lot of extra information in them beyond the text content (e.g., cell and border colors, conditional formatting, “smart” quotes, etc.). While there are some special R packages that have been built to read and write common special formats (like Excel), R itself will convert the contents to a plain text format and work with them in that way. For that reason, you should consider whether your dataset can be easily converted to a plain text format. If you aren’t sure, check with Dr. Dowling. We may be able to program a custom solution for you. If not, though, your dataset may be unsuitable for direct analysis in R.

Most of the programs you are likely to be currently using to collect and store your data (e.g., Excel, SPSS, Qualtrics, etc.) have an option to export your data as a plain text file (usually a .csv). You should use this option to export your data before the course begins. If you can’t figure out how to do that, your data may not be appropriate for this course.

Tabular

Your data should be organized into a table (i.e., with rows and columns). The canonical format in tidyverse is that every column is a variable and every row is an observation, but you might have good reasons for organizing your data the other way around or in some other fashion. As long as your data are convertible to a tabular format, you’ll be fine!

In practice this usually means starting from a spreadsheet (like Excel or Google Sheets) and converting to a plain text tabular format (like .csv or.tsv).

Structured around a research question

In principle, you do could just download and play around with any dataset that meets the above criteria. But in order for this course to be useful to you in learning about the process of generating a scientific report, you should choose a dataset for which (a) you have a motivated research question and (b) the contents are structured in such a way that you are able to conduct a study to address your research question.

As stated above, the students who get the most out of the course are those who enter with an active research project. On the first day of class you should be able to clearly articulate at least 2 interesting research questions that are functionally answerable with your dataset.

Anonymized/anonymizable

In this course we’ll be learning how to develop a replicable scientific report via GitHub. Because GitHub records everything you ever commit in its history, you want to be 100% certain that you only ever commit anonymized data. For some students, anonymizing the data is as simple as making sure that the participant metadata is anonymous (e.g., using random strings of letters for participant IDs with a set of log files from experiment software). For other students, it might take a little effort to anonymize the data (e.g., comprehensively scanning and inserting pseudonyms where necessary or selectively removing parts of the files used). If you’re unsure how to go about this process, try and think through some options and then ask Dr. Dowling to brainstorm with you about what would be practical.

If you’re working on a thesis or otherwise collaborating with an advisor/lab, you should also check with them about what level of anonymization is necessary.

Assessment

The final research project is worth up to 40 points of your final grade. Up to 30 points are earned from demonstrating each unique objective, and up to 10 points are earned through engagement. Drafts can be submitted up to 4 times throughout the quarter, with the final grade being that of the final submission.

Learning objetives

You can earn up to 30 objectives points on the final research project, one for each unique objective you demonstrate. The final draft of the project should demonstrate all 30 assessed standards.

The data-driven research project will be an APA7 scientific report created in RStudio with Quarto that is developed via a GitHub repository.

Your priority should be demonstrating mastery of the course objectives rather than meeting precise component requirements. Specifics will vary for each student, but in general these projects will include:

  1. Organized GitHub repository with README and .gitignore
  2. Dataset(s) and R scripts for data read-in, pre-processing, and analysis
  3. Reproducible Quarto document (.qmd file) using the apaquarto template:
    1. Uses YAML header, markdown syntax, and R code chunks
    2. Presents data with tables and figures
    3. Conducts and reports multiple kinds of simple statistical analyses
    4. Includes in-text citations and a reference list in APA7 format with BibTeX
    5. Contains at least 1500 words of narrative text in at least four subsections (Introduction, Methods, Results, Discussion)
    6. Dynamically references variables and includes inline R code
  4. Final PDF output of the Quarto document, following APA7 formatting

You can view the full list of objectives and recommendations for how to meet them in the final project assessment document.

Engagement

You can earn up to 10 engagement points on the final project. Broadly, these points acknowledge the effort you put into the project, the creativity you demonstrate, and your investment in the work beyond the class. This can include things like:

  • Creating many figures and tables, or particularly complex or creative ones
  • Impressively thoughtful and thorough narrative writing in your literature review or discussion section
  • Employing sophisticated statistical techniques in your analysis
  • Making excellent use of markdown features to create a polished final product
  • Having a maximally reproducible and dynamic manuscript
  • Fully committing to best practices for version control and GitHub integration/organization

FAQ

Visit the FAQ page for answers to common questions about the final project and other course expectations.