We like learning on vacation. And we're on vacation, so we thought we'd re-air this episode about how to learn.
Original Summary: If you're anything like us, you a) always are curious to learn more about data science and machine learning and stuff, and b) are usually overwhelmed by how much content is out there (not all of it very digestible). We hope this podcast is a part of the solution for you, but if you're looking to go farther (who isn't?) then we have a few new resources that are presenting high-quality content in a fresh, accessible way. Boring old PDFs full of inscrutable math notation, your days are numbered!
We're on vacation on Mars, so we won't be communicating with you all directly this week. Though, if we wanted to, we could probably use this episode to help get started.
Original Episode: http://lineardigressions.com/episodes/2017/3/19/space-codes
Original Summary: It's hard to get information to and from Mars. Mars is very far away, and expensive to get to, and the bandwidth for passing messages with Earth is not huge. The messages you do pass have to traverse millions of miles, which provides ample opportunity for the message to get corrupted or scrambled. How, then, can you encode messages so that errors can be detected and corrected? How does the decoding process allow you to actually find and correct the errors? In this episode, we'll talk about three pieces of the process (Reed-Solomon codes, convolutional codes, and Viterbi decoding) that allow the scientists at NASA to talk to our rovers on Mars.
We're on vacation, so we hope you enjoy this episode while we each sip cocktails on the beach.
Original Episode: http://lineardigressions.com/episodes/2017/6/18/anscombes-quartet
Original Summary: Anscombe's Quartet is a set of four datasets that have the same mean, variance and correlation but look very different. It's easy to think that having a good set of summary statistics (like mean, variance and correlation) can tell you everything important about a dataset, or at least enough to know if two datasets are extremely similar or extremely different, but Anscombe's Quartet will always be standing behind you, laughing at how silly that idea is.
Anscombe's Quartet was devised in 1973 as an example of how summary statistics can be misleading, but today we can even do one better: the Datasaurus Dozen is a set of twelve datasets, all extremely visually distinct, that have the same summary stats as a source dataset that, there's no other way to put this, looks like a dinosaur. It's an example of how datasets can be generated to look like almost anything while still preserving arbitrary summary statistics. In other words, Anscombe's Quartets can be generated at-will and we all should be reminded to visualize our data (not just compute summary statistics) if we want to claim to really understand it.
Now that hurricane season is upon us again (and we are on vacation), we thought a look back on our hurricane forecasting episode was prudent. Stay safe out there.
Original Episode: http://lineardigressions.com/episodes/2017/9/17/hurricane-forecasting
Original Summary: It's been a busy hurricane season in the Southeastern United States, with millions of people making life-or-death decisions based on the forecasts around where the hurricanes will hit and with what intensity. In this episode we'll deconstruct those models, talking about the different types of models, the theory behind them, and how they've evolved through the years.
In this episode, we talk about some of the potential ramifications of GRPD in the world of data science.
If you're a data scientist, chances are good that you've heard of git, which is a system for version controlling code. Chances are also good that you're not quite as up on git as you want to be--git has a strong following among software engineers but, in our anecdotal experience, data scientists are less likely to know how to use this powerful tool. Never fear: in this episode we'll talk through some of the basics, and what does (and doesn't) translate from version control for regular software to version control for data science software.
Data science and analytics are hot topics in business these days, but for a lot of folks looking to bring data into their organization, it can be hard to know where to start and what it looks like when they're succeeding. That was the motivation for writing a whitepaper on the analytics maturity of an organization, and that's what we're talking about today. In particular, we break it down into five attributes of an organization that contribute (or not) to their success in analytics, and what each of those mean and why they matter.
Shapley values in machine learning are an interesting and useful enough innovation that we figured hey, why not do a two-parter? Our last episode focused on explaining what Shapley values are: they define a way of assigning credit for outcomes across several contributors, originally to understand how impactful different actors are in building coalitions (hence the game theory background) but now they're being cross-purposed for quantifying feature importance in machine learning models. This episode centers on the computational details that allow Shapley values to be approximated quickly, and a new package called SHAP that makes all this innovation accessible.
As machine learning models get into the hands of more and more users, there's an increasing expectation that black box isn't good enough: users want to understand why the model made a given prediction, not just what the prediction itself is. This is motivating a lot of work into feature important and model interpretability tools, and one of the most exciting new ones is based on Shapley Values from game theory. In this episode, we'll explain what Shapley Values are and how they make a cool approach to feature importance for machine learning.
- A previous Linear Digressions episode on LIME, another feature importance algorithm
- A Unified Approach to Interpreting Model Predictions
- Game Theory Attribution: The Model You've Probably Never Heard Of (good introductory explanation to Shapley values)
- One Feature Attribution Method to (Supposedly) Rule Them All: Shapley Values
If you were a machine learning researcher or data scientist ten years ago, you might have spent a lot of time implementing individual algorithms like decision trees and neural networks by hand. If you were doing that work five years ago, the algorithms were probably already implemented in popular open-source libraries like scikit-learn, but you still might have spent a lot of time trying different algorithms and tuning hyperparameters to improve performance. If you're doing that work today, scikit-learn and similar libraries don't just have the algorithms nicely implemented--they have tools to help with experimentation and hyperparameter tuning too. Automated machine learning is here, and it's pretty cool.
A huge part of the ascent of deep learning in the last few years is related to advances in computer hardware that makes it possible to do the computational heavy lifting required to build models with thousands or even millions of tunable parameters. This week we'll pretend to be electrical engineers and talk about how modern machine learning is enabled by hardware.
Last episode we talked conceptually about capsule networks, the latest and greatest computer vision innovation to come out of Geoff Hinton's lab. This week we're getting a little more into the technical details, for those of you ready to have your mind stretched.
- Understanding Hinton's Capsule Networks. Part 1: Intuition
- Understanding Hinton's Capsule Networks. Part 2: How Capsules Work
- Understanding Hinton's Capsule Networks. Part 3: Dynamic Routing Between Capsules
- Understanding Hinton's Capsule Networks. Part 4: CapsNet Architecture
- A curated list of awesome resources related to capsule networks
Convolutional nets are great for image classification... if this were 2016. But it's 2018 and Canada's greatest neural networker Geoff Hinton has some new ideas, namely capsule networks. Capsule nets are a completely new type of neural net architecture designed to do image classification on far fewer training cases than convolutional nets, and they're posting results that are competitive with much more mature technologies.
In this episode, we'll give a light conceptual introduction to capsule nets and get geared up for a future episode that will do a deeper technical dive.
If you've done image recognition or computer vision tasks with a neural network, you've probably used a convolutional neural net. This episode is all about the architecture and implementation details of convolutional networks, and the tricks that make them so good at image tasks.
It's been a nasty flu season this year. So we were remembering a story from a few years back (but not covered yet on this podcast) about when Google tried to predict flu outbreaks faster than the Centers for Disease Control by monitoring searches and looking for spikes in searches for flu symptoms, doctors appointments, and other related terms. It's a cool idea, but after a few years turned into a cautionary tale of what can go wrong after Google's algorithm systematically overestimated flu incidence for almost 2 years straight.
This week's episodes is for data scientists, sure, but also for data science managers and executives at companies with data science teams. These folks all think very differently about the same question: what should a data science team be working on? And how should that decision be made? That's the subject of a talk that I (Katie) gave at Strata Data in early March, about how my co-department head and I select projects for our team to work on.
We have several goals in data science project selection at Civis Analytics (where I work), which can be summarized under "balance the best attributes of bottom-up and top-down decision-making." We achieve this balance, or at least get pretty close, using a process we've come to call the Idea Factory (after a great book about Bell Labs). This talk is about that process, how it works in the real world of a data science company and how we see it working in the data science programs of other companies.
Autoencoders are neural nets that are optimized for creating outputs that... look like the inputs to the network. Turns out this is a not-too-shabby way to do unsupervised machine learning with neural nets.
After all the back-patting around making data science datasets and code more openly available, we figured it was time to also dump a bucket of cold water on everyone's heads and talk about the things that can go wrong when data and code is a little too open.
In this episode, we'll talk about two interesting recent examples: a de-identified medical dataset in Australia that was re-identified so specific celebrities and athletes could be matched to their medical records, and a series of military bases that were spotted in a public fitness tracker dataset.
A few weeks ago, we podcasted about a neural network that was being touted as "better than doctors" in diagnosing pneumonia from chest x-rays, and how the underlying dataset used to train the algorithm raised some serious questions. We're back again this week with further developments, as the author of the original blog post pointed us toward more developments. All in all, there's a lot more clarity now around how the authors arrived at their original "better than doctors" claim, and a number of adjustments and improvements as the original result was de/re-constructed.
Anyway, there are a few things that are cool about this. First, it's a worthwhile follow-up to a popular recent episode. Second, it goes *inside* an analysis to see what things like imbalanced classes, outliers, and (possible) signal leakage can do to real science. And last, it raises a really interesting question in an age when computers are often claimed to be better than humans: what do those claims really mean?