Opening up the data, and science, in data science

February 18, 2018

One interesting trend we've noted recently is the proliferation of papers, articles and blog posts about data science that don't just tell the result--they include data and code that allow anyone to repeat the analysis. It's far from universal (for a timely counterpoint, read this article), but we seem to be moving toward a new normal where data science conclusions are expected to be shown, not just told.

Relevant links:

Defining the quality of a machine learning production system

February 11, 2018

Building a machine learning system and maintaining it in production are two very different things. Some folks over at Google wrote a paper that shares their thoughts around all the items you might want to test or check for your production ML system.

Relevant links:

What's your ML test score? A rubric for ML production systems

Auto-generating websites with deep learning

February 04, 2018

We've already talked about neural nets in some detail (links below), and in particular we've been blown away by the way that image recognition from convolutional neural nets can be fed into recurrent neural nets that generate descriptions and captions of the images. Our episode today tells a similar tale, except today we're talking about a blog post where the author fed in wireframes of a website design and asked the neural net to generate the HTML and CSS that would actually build a website that looks like the wireframes. If you're a programmer who thinks your job is challenging enough that you're automation-proof, guess again...

Relevant Links:

The Further Case for "The Case for Learned Index Structures"

January 28, 2018

Last week we started the story of how you could use a machine learning model in place of a data structure, and this week we wrap up with an exploration of Bloom Filters and Hash Maps. Just like last week, when we covered B-trees, we'll walk through both the "classic" implementation of these data structures and how a machine learning model could create the same functionality.

Relevant links:

The Case for "The Case for Learned Index Structures"

January 21, 2018

Jeff Dean and his collaborators at Google are turning the machine learning world upside down (again) with a recent paper about how machine learning models can be used as surprisingly effective substitutes for classic data structures. In this first part of a two-part series, we'll go through a data structure called b-trees. The structural form of b-trees make them efficient for searching, but if you squint at a b-tree and look at it a little bit sideways then the search functionality starts to look a little bit like a regression model--hence the relevance of machine learning models. If this sounds kinda weird, or we lost you at b-tree, don't worry--lots more details in the episode itself.

Relevant links:

The Case for Learned Index Structures

Challenges with Using Machine Learning to Classify Chest X-Rays

January 14, 2018

Another installment in our "machine learning might not be a silver bullet for solving medical problems" series. This week, we have a high-profile blog post that has been making the rounds for the last few weeks, in which a neural network trained to visually recognize various diseases in chest x-rays is called into question by a radiologist with machine learning expertise. As it seemingly always does, it comes down to the dataset that's used for training--medical records assume a lot of context that may or may not be available to the algorithm, so it's tough to make something that actually helps (in this case) predict disease that wasn't already diagnosed.

Relevant links:

Fourier Transforms

January 07, 2018

The Fourier transform is one of the handiest tools in signal processing for dealing with periodic time series data. Using a Fourier transform, you can break apart a complex periodic function into a bunch of sine and cosine waves, and figure out what the amplitude, frequency and offset of those component waves are. It's a really handy way of re-expressing periodic data--you'll never look at a time series graph the same way again.

Relevant links:

An Interactive Guide to the Fourier Transform

Re-release: Guinness

January 01, 2018

What better way to kick off a new year than with an episode on the statistics of brewing beer?

Re-release: Random Kanye

December 24, 2017

We have a throwback episode for you today as we take the week off to enjoy the holidays. This week: what happens when you have a markov chain that generates mashup Kanye West lyrics with Bible verses? Exactly what you think.

Relevant links:

The Genesis of Kanye

De-Biasing Word Embeddings

December 17, 2017

When we covered the Word2Vec algorithm for embedding words, we mentioned parenthetically that the word embeddings it produces can sometimes be a little bit less than ideal--in particular, gender bias from our society can creep into the embeddings and give results that are sexist. For example, occupational words like "doctor" and "nurse" are more highly aligned with "man" or "woman," which can create problems because these word embeddings are used in algorithms that help people find information or make decisions. However, a group of researchers has released a new paper detailing ways to de-bias the embeddings, so we retain gender info that's not particularly problematic (for example, "king" vs. "queen") while correcting bias.

Relevant links:

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings

The Kernel Trick and Support Vector Machines

December 10, 2017

Picking up after last week's episode about maximal margin classifiers, this week we'll go into the kernel trick and how that (combined with maximal margin algorithms) gives us the much-vaunted support vector machine.

Relevant links:

Everything you wanted to know about the kernel trick (but were too afraid to ask)

Maximal Margin Classifiers

December 03, 2017

Maximal margin classifiers are a way of thinking about supervised learning entirely in terms of the decision boundary between two classes, and defining that boundary in a way that maximizes the distance from any given point to the boundary. It's a neat way to think about statistical learning and a prerequisite for understanding support vector machines, which we'll cover next week--stay tuned!

Re-release: The Cocktail Party Problem

November 26, 2017

Grab a cocktail, put on your favorite karaoke track, and let’s talk some more about disentangling audio data!

DBSCAN

November 19, 2017

DBSCAN is a density-based clustering algorithm for doing unsupervised learning. It's pretty nifty: with just two parameters, you can specify "dense" regions in your data, and grow those regions out organically to find clusters. In particular, it can fit irregularly-shaped clusters, and it can also identify outlier points that don't belong to any of the clusters. Pretty cool!

Relevant links:

An awesome interactive blog post by Naftali Harris showing how DBSCAN works

Kaggle's "State of Data Science" Survey

November 13, 2017

Want to know what's going on in data science these days? There's no better way than to analyze a survey with over 16,000 responses that recently released by Kaggle. Kaggle asked practicing and aspiring data scientists about themselves, their tools, how they find jobs, what they find challenging about their jobs, and many other questions. Then Kaggle released an interactive summary of the data, as well as the anonymized dataset itself, to help data scientists understand the trends in the data. In this episode, we'll go through some of the survey toplines that we found most interesting and counterintuitive.

Relevant links:

Machine Learning Technical Debt

November 05, 2017

This week, we've got a fun paper by our friends at Google about the hidden costs of maintaining machine learning workflows. If you've worked in software before, you're probably familiar with the idea of technical debt, which are inefficiencies that crop up in the code when you're trying to go fast. You take shortcuts, hard-code variable values, skimp on the documentation, and generally write not-that-great code in order to get something done quickly, and then end up paying for it later on. This is technical debt, and it's particularly easy to accrue with machine learning workflows. That's the premise of this episode's paper.

Relevant links:

Machine Learning: The High Interest Credit Card of Technical Debt

Improving Upon a First-Draft Data Science Analysis

October 29, 2017

There are a lot of good resources out there for getting started with data science and machine learning, where you can walk through starting with a dataset and ending up with a model and set of predictions. Think something like the homework for your favorite machine learning class, or your most recent online machine learning competition. However, if you've ever tried to maintain a machine learning workflow (as opposed to building it from scratch), you know that taking a simple modeling script and turning it into clean, well-structured and maintainable software is way harder than most people give it credit for. That said, if you're a professional data scientist (or want to be one), this is one of the most important skills you can develop.

In this episode, we'll walk through a workshop Katie is giving at the Open Data Science Conference in San Francisco in November 2017, which covers building a machine learning workflow that's more maintainable than a simple script. If you'll be at ODSC, come say hi, and if you're not, here's a sneak preview!

Survey Raking

October 22, 2017

It's quite common for survey respondents not to be representative of the larger population from which they are drawn. But if you're a researcher, you need to study the larger population using data from your survey respondents, so what should you do? Reweighting the survey data, so that things like demographic distributions look similar between the survey and general populations, is a standard technique and in this episode we'll talk about survey raking, a way to calculate survey weights when there are several distributions of interest that need to be matched.

Relevant links:

Happy Hacktoberfest

October 15, 2017

It's the middle of October, so you've already made two pull requests to open source repos, right? If you have no idea what we're talking about, spend the next 20 minutes or so with us talking about the importance of open source software and how you can get involved. You can even get a free t-shirt!

Relevant links:

Re-release: Kalman Runners

October 08, 2017

In honor of the Chicago marathon this weekend (and due in large part to Katie recovering from running in it...) we have a re-release of an episode about Kalman filters, which is part algorithm part elaborate metaphor for figuring out, if you're running a race but don't have a watch, how fast you're going.

Katie's Chicago race report:

miles 1-13: light ankle pain, lovely cool weather, the most fun imaginable.
miles 13-17: no more ankle pain but quads start getting tight, it's a little more effort
miles 17-20: oof, really tight legs but still plenty of gas in then tank.
miles 20-23: it's warmer out now, legs hurt a lot but running through Pilsen and Chinatown is too fun to notice
mile 24: ugh cramp everything hurts
miles 25-26.2: awesome crowd support, really tired and loving every second

Final time: 3:54:35