Notes: You can't do data science in a GUI

Challenge of DS: Taking a vague question and making it precise enough that you can answer and use it qualitatively

Import data -> make it tidy -> Job as DS is to understand the data

Understand is a loop of:

  1. Transforming: Create new variable & summaries
  2. Visualizing: Surprises but doesn’t scale
    1. You can start with a vague question and the viz will help you
    2. You can see something you don’t expect
    3. Don’t scale because there is a human in the loop
  3. Model: Scales but doesn’t fundamentally surprises
    1. Makes assumptions that cannot break, so it cannot surprise you

After understanding you want to:

  1. Communicate - with a supervisor or whatever
  2. Automate - how to deploy

Why program if i want to do DS?

Two extremes:

  1. Excel: you can see what you can do but you are constrained
  2. You don’t see whats possible (blink cursor) but you are free to

Programming languages are languages - you can express your thoughts in it It can be hard to express thoughts (code/text makes it easier)

Coding is just text:

As data changes code just need to run again and everything feels into place.

In a GUI: You live in fear of clicking the wrong thing and making it permanent since its hard to roll back

Why use R

R is a vector language

Missing values included

Data Frame (table) included

Functional programming


Allows you to create DSLs that allows to express thing in different ways for different tasks (e.g. plotting and cleaning data)

No matter how complex and polished the individual operation are, it is often the quality of the glue that most directly determines the power of the system - Hal Abelson


Some things are hard to express in code (or maybe just a painful): like changing variables names UI plugins that are tied into code (or produce code) are useful for this can of stuff

Autocomplete is key


  1. Huge advantages to code
  2. R provides great environment - Doesn’t mean Python sucks! :)
  3. DSLs help express your thoughts
  4. Code should be the primary artifact (but might be generated other than typing)