Slides for Stockholm

Home > Uncategorized > Slides for Stockholm

Slides for Stockholm

October 30, 2015 Cathy O'Neil, mathbabe

I’ve been busy preparing the data science tutorial I’m giving next week in Stockholm, and I thought I’d share my prezi slides with you. Almost everything in these slide decks is stolen from the web, and the more I worked on my presentation the more I realized how much of a tool the web itself has become for learning and explaining things.

The tutorial will be divided up into three parts. The first part I call “Data,” and it takes 2.5 hours. In that time I introduce the kind of data used in various fields of data science, how to get the data, how to store it, and how to do basic exploratory data analysis, cleaning, and basic statistics. Here’s the slide deck.

The second part is called “Models,” also 2.5 hours, and during that section I discuss the modeling process, including defining success, finding proxies, understanding information, choosing algorithms, understanding results through visualization, the problem of overfitting, and how to avoid it. The slide deck for Models is here.

In the final part, which is 1.5 hours, I am calling my presentation Product, and it addresses the various ways data science projects are published, whether through production code in higher level languages, or academic journals, or data journalism. Here I address end-product visualizations, keeping models updated with new data, building in feedback loops, and documentation. I’m not quite done with this one but close enough. That slide deck is here.

Tell me if you think I’m missing something!

Categories: Uncategorized

Comments (14)

David Chauvin

October 30, 2015 at 9:15 am

Very aesthetically pleasing to the eye. This is outside my field, so I cannot comment on the content. I am amazed at how much you get done in one day.

Perhaps your next blog post should cover your advice on productivity and personal time management.

Keep up your good work!

LikeLike
- Cathy O'Neil, mathbabe
  
  October 30, 2015 at 9:16 am
  
  No, no! I have been working on this for 3 weeks. But yes, that would have been amazing. 🙂
  
  LikeLike
Zathras

October 30, 2015 at 9:21 am

I like the slides a lot. The one place is where I see a gap is with understanding the data. The slides do a good job of explaining how data can be understood by letting the data talk, but there is a lot to be understood on the “soft skills” side. The kinds of things that are important here are understanding (1) how was the data created; (2) how was the data modified before I got my hands on it; and (3) how is the data used. This is particularly important in the “business data” bucket, where there are typically a lot of surprises with each of these categories. In these cases, the data is silent; instead to have to talk to people who know the data to answer these questions.

LikeLike
- Cathy O'Neil, mathbabe
  
  October 30, 2015 at 9:24 am
  
  Great point. What the slides don’t show is a series of examples I’m working out on iPython notebooks where we’ll be able to see examples of this kind of thing.
  
  LikeLike
  - Zathras
    
    October 30, 2015 at 2:46 pm
    
    I’d be very interested in seeing your examples for this. I have found these issues very difficult to communicate to others who do not have the experience with them.
    
    LikeLike
    - Guest2
      
      October 31, 2015 at 11:22 pm
      
      I agree with the points, but want to broaden it a bit.
      It shouldn’t be hard — data is a construct, first and foremost. It is the product of social interaction at various levels, organizational and institutional environments, and it always has multiple contexts. Science studies, and history of statistics, take this into account.
      
      Just remember that all data has a context, and this helps to minimize what Donna Haraway calls the “god trick” that statistical realists seem to be trapped in.
      https://en.wikipedia.org/wiki/Donna_Haraway
      
      LikeLike
amc

October 30, 2015 at 10:36 am

Agree with excellent visualization and organization. The flow between sessions carries through very well (from reading the slides).

Three comments for your consideration:

1) in Data/Data Storage: From the bullets, it looks like MR, Spark + Pig are for data storage, not frameworks to access data. I realize this is just from reading, not hearing the talk, so it might be a moot point.

2) in Data: sometimes the bubbles in the background blend with the words on my screen. This might be different on a projection or other display.

3) in Models/Decision Trees: “sophistated”. Didn’t know if that was a term I wasn’t familiar with or if it was a typo for sophisticated.

Best of luck!

LikeLike
ax42

October 30, 2015 at 11:35 am

Is there a way to see the slides if you don’t have Flash installed on your computer?

LikeLike
- Cathy O'Neil, mathbabe
  
  October 30, 2015 at 11:48 am
  
  No: https://prezi.com/support/article/troubleshooting/system-requirements-for-prezi/?lang=en
  
  LikeLike
  - Cathy O'Neil, mathbabe
    
    October 30, 2015 at 11:50 am
    
    Thanks so much, I’ll incorporate those excellent comments!
    
    LikeLike
- An old geezer engineer named Dan
  
  November 4, 2015 at 5:57 am
  
  Use Google’s Chrome browser. The Flash viewer built into Chrome is far safer than Adobe’s and just worked reasonably well on my Mac to look at them.
  
  And a slightly snarky comment to our good Mathbabe: Ordinary trees are just green and brown blobs??? A guy whose concepts were quite important in my own past work – the late Benoit Mandelbrot – might have disagreed with you. Then there are Banyan trees…
  
  LikeLike
Olof F

November 5, 2015 at 3:23 pm

And the slides worked well during the tutorial, despite the screen saver 🙂
Thank You for an interesting day.

LikeLike
Joshua

November 17, 2015 at 5:56 am

These are really great presentations. A couple of questions/observations:
* What is the relationship between data adjectives “clean” and “noisy”? I’ve suspect I’ve often used these as antonyms.
* love the explicit recognition of “non-existent data.” Just because we don’t have something, doesn’t mean it isn’t (conceptually) important.

LikeLike
- Cathy O'Neil, mathbabe
  
  November 17, 2015 at 6:36 am
  
  In this context of finance, clean means you can trust the data. For the most part. As opposed to dirty data where you have to watch out.
  
  Noisy means the signal is faint or to non-existent. Depending on what signal you’re looking for, of course.
  
  -C
  
  On Tue, Nov 17, 2015 at 5:56 AM, mathbabe wrote:
  
  >
  
  LikeLike