# Video transcript
:::{note}
This "draft" transcript has been downloaded from YouTube and needs to be polished.
:::
Video: [Introduction to Probabilistic Programming with PyMC](https://youtu.be/Qu6-_AnRCs8)
Speakers: Austin Rochford (AR), Reshama Shaikh (RS)
**RS:** Hello, everyone. Welcome to Data Umbrella's webinar. I'm going to do a brief introduction about Data Umbrella and then Austin will do his talk and during the talk you are free to ask any questions in the
chat and then we will answer them, you know just like whenever it's a good time, either every 10 minutes or towards the end. This webinar is being recorded it will be placed on our YouTube typically within 24 hours and I will post the link to our YouTube in the chat once I finish my brief presentation.
A little bit about Data Umbrella. We are a community for underrepresented persons in data science. About me: I am a statistician slash data scientist. I'm based in New York City as you can hear the sirens. That is quite normal from here. You can find me on Twitter, LinkedIn and GitHub at reshamas. We have a code of conduct and thank you for helping make this a welcoming friendly professional community for all of us. There are various ways that you can support Data Umbrella -- the first and foremost is to follow our code of conduct and contribute to making it a really you know welcoming collaborative space where people want to return.
We also have a discord where you can join and ask and answer general questions and share events or information. We're also on Open Collective so if you would like, feel free to donate. If you work for a corporation
that uses Benevity, this is a company match platform, you can also find us on Benevity. We have various video playlists on YouTube. And one of them is contributing to open source and you can see here that we have videos for numpy and core python and scikit-learn and pandas.
We also have a playlist on career advice. We've had four fabulous speakers so check that out if you are looking for that and these are all the playlists that we have. We have career advice, data visualization, data science beginner friendly videos, contributing to open source, scikit-learn sprint. And we have just started a new playlist for the PyMC series and this will be added to that one and this is just a sampling of some of the videos that we have on our YouTube.
We also have a lot of resources on our website related to using inclusive language, allyship, burnout, AI ethics. You can check those out on your own. We are on all social media platforms as data umbrella. The ones that I have highlighted, Twitter, LinkedIn, are the most active. Meetup is the place to join to find out about upcoming events and updates. And we have a blog as well and we have a monthly newsletter so I will share the link to that. Feel free to sign up. We only send a newsletter once a month and we promise not to spam you in any way, so if you'd like to sign up for that that's also helpful.
We use BigMarker for our webinar platforms and there is live captioning available so if you go to the very top there is should be a button set that says cc and you can turn on the live captioning. From the feedback that I've heard the English live captioning is pretty good. You can try the other languages, I think they're not
they're not quite as accurate.
Our upcoming event that we have is "Automating Workflows with GitHub actions." That's going to be on January 18th, oops, lost my slides. Hang on just a minute, sorry about that, all right, there we go. Yeah sorry I don't know what's going on.
Today's talk is Introduction to Probabilistic Programming with PyMC with Austin Rochford. Austin is the chief data scientist at Kibo Commerce. He is a recovering mathematician who is passionate about math education, Bayesian statistics and machine learning, and you can find Austin on Linkedin, Twitter and GiHub at austinrochford. I don't know if I'm pronouncing the name right but Austin will let us know when when he speaks. And feel free to tweet about the event. Our twitter is at data umbrella and with that I am going to turn off my
mic and camera and hand it over to Austin.
**AR:** Thank you Reshama for that introduction and thank you for organizing this event together with all of the other Data Umbrella folks on your team for organizing and promoting and it takes a lot of work to put an event together. And thanks to the folks on the PyMC side who's who have been coordinating with Reshama and her team with Data Umbrella so thank you for asking me to speak here today. I'm excited to share something I care a lot about with all of you and Oriol as well and everyone else who is contributing to that effort.
So my goal today is to introduce the idea of both probabilistic
programming and how it works in pi mc kind of from a knowledge of python a
little bit of numpy um up to an actually interesting application to give you an idea of what
is probabilistic programming why do we at pi mc find it exciting
um so about this talk as you can tell this talk is a jupiter notebook uh the slide
rendering the code was difficult to see so i'm just going to scroll through the notebook apologies for that it's available on the
pi mc data umbrella sprint github repo at this link i think oriole and i have both also
shared uh links in the chat with access to this notebook so please
do load it up on your own play with some of the examples modify them for yourself
so in terms of an outline where are we going to go for the next 40 minutes or so 45 minutes we're going to
talk about probabilistic programming from two different perspectives philosophical and mathematical this is a
little bit of motivation we're going to see how exactly does probabilistic programming play out in pi
mc we're going to use two somewhat simple examples here to illustrate the basic concepts one is one
of my favorite little probability exercises the monty hall problem which you've probably seen or heard about
before and the second is robust regression this will show how we can easily iterate on a
model using probabilistic programming one of its real strengths then we'll talk about why pymc why should you
actually care about pi mc as opposed to coding these inferences from scratch or some other system so we're going to talk
a little bit about the strengths of hamiltonian monte carlo and asera which is kind of pi mc's uh peer project that
that enables a lot of the exciting things that pi mc does then i'm going to give you a taste of
what a real uh you know a near real analysis that i've done around lego pricing
with this so if you can tell from my background i collect legos uh star wars
space dinosaurs a smattering of other ones uh so i'm very interested in what drives the prices of lego sets if you go
on my website which is just austin rochford.com i've written about this fair amount show how to analyze those
using pymc so a real world application and provide some resources and next steps right and i think rachma will ask
uh maybe mean all to to say what the next steps are with the sprint that we're enabling here but if you want to
learn more about either bayesian statistics probabilistic programming or pi mc i have some
resources linked at the end as well so the interest of time let's dive right in
so i take two different perspectives on probabilistic programming the first is a little philosophical right so people a
lot about a lot of the time talk about data science as doing inference to tell stories right data science is all about
telling stories with data i've included uh one of two favorite images here
um this is the size of napoleon's army right as he marches to and then back
from moscow during his ill-fated invasion of russia what's the story this data visualization tells you i think it
tells you not to invade russia during the winter uh but it's very much data goes to stories right that's the direction the
arrow flows you do some calculations on your data you use that to tell a story
probabilistic programming one of the real strengths about it and how i think about is it kind of turns that arrow around probabilistic programming says
i'm going to tell the story of how my data is generated and then encode and then i'm going to
use that to perform inference about unknown quantities uh whether it's the size of napoleon's army at a certain
point in time whether it's the probability that a coin is flat fair whether it's the effect that the number
of pieces has on the price of a lego set you know there's some unknown quantity in the world i'm going to tell a story
about how it relates to the data i see and then i'm going to use the data i've seen to infer
some things about that unknown quantity so that is how probabilistic programming kind of turns that around
um and it turns that around if you if you put your you know mathematical hat on a
little bit and you think about turn that around okay so if i have uh
a story and i get data from that story how do i turn that conditional probability around
well that's bayesian statistics right bayesian statistics is all about how do i go from the probability of b
given a and turn that into the probability of a given b right so this is all about how do i reverse
conditional probabilities same idea here going from data to storytelling and turning that
out around is what really motivates me here so this is kind of the philosophical level could say a lot more
there i'm not going to dwell too deeply let's dive into the mathematical level so the mathematical level this all plays
out um in something called monte carlo methods right we're going to be talking about
advanced monte carlo methods here and so what's a simple example of a monte carlo method well i'll give you one right here
let's generate 5 000 points that are uniformly distributed in the square that
ranges from negative one to one right so that's shown here
okay and i apologize these some of these things will be duplicated because these were built to be rendered as slides but
the formatting wasn't making the code legible so we're being a little bit flexible here so we've got our five thousand points in
this square negative one to one right let's count the number of them that are inside the circle with radius
ones so these are inside the circle with radius one if x squared plus y squared is less than one
right this plot highlights those that are inside the circle of radius one uh the green ones are inside the circle
the orange ones aren't inside the circle uh and then you know
what proportion of these should we expect to be inside the circle right
well it probably has to do with the area of the circle right the area of the circle is one
right the area of the square is four so if we take you know number of points
green points inside the circle and we divide that by total number of points which is five thousand we should get the
ratio of the area of the circle to the area of the square which is pi over four
and sure enough we add up the number that are in the circle we divide it by five thousand and
we multiply it by four we get not an awful approximation to pi right so this is what a monte carlo
method is a monte carlo method is all about using random numbers cleverly to approximate the quantity
you're interested in and we'll get into how that connects to probabilistic programming and bayesian inference in a second
how does this connect to bayesian inference well what we've done here is actually approximate this integral
right if you think back to calculus this integral relationship is true right because the square root of one minus x
squared that defines this quarter circle right so this integral gives you the area of
this quarter circle here multiply that by four you should get the area of the whole circle which is pi
right so we instead of doing this integral through maybe a trigonometric substitution or whatever you might have
learned in calculus approximated its value through random number generation
and bayes theorem has often very many intractable
integrals here right one way to restate the bayes theorem as i showed above is this way where theta are your unknown
parameters and d is your data right so if we want to know about our
unknown parameters given our data it tells us that it's data given unknown parameters probabilistic programming is
all about specifying this quantity right prior is not that hard also
specify the prior but then calculating this integral at the bottom is intractable
not possible provably impossible for a lot of interesting models and if we restrict ourselves to models just models
where you can calculate this quantity life becomes a lot less interesting we close off a lot of potential
applications so that's what gets us to consider probabilistic programming using monte
carlo methods in this case with pi mc although there are many different uh packages out there to do it with
um so all of that was great theory let's get down to practice now and let's talk
about the monty hall problem so if you're not familiar with the monty hall problem right there's a famous game show
in the united states going back 50 years even maybe probably even longer than that called let's make a deal right the
host was named mommy hall for a number of years and so that's where this comes from and one of the games the contestants would play in
let's make a deal would be that monty would show them three doors closed right behind two doors were goats
and behind one door was a sports car right so obviously in
the end you want to open the door that has a sports car behind it so how this game would play out is that
monty would ask the contestant to choose a door right and not open it yet so say you
chose door number one the first door right what monty would then do is open one of the other doors
in this example door three but it could be whichever one to show you a goat and then this is where the drama comes
in he would ask you do you want to switch your choice of door yes or no
right and this is one of the kind of first examples where conditional probability becomes a little bit
counterintuitive uh probably a lot of you know is that the answer is yes you should switch your
choice of door right um there after he's opened a door to show
you a goat there's now a one-third chance that your door contains the sports car
and a two-third chance that the other unopened door contains a sports car so you should switch
in order to maximize the chance that you choose the prize you actually want right and we could work this out
mathematically here i'm not going to run through it because it's tedious the point is probabilistic programming will
enable us to write some code to avoid doing this right because it's possible in this situation but not particularly
interesting and a lot of situations it won't be possible to work out like this so it's always good when you're learning
a new numerical method to apply it to a problem where you can get the answer with pen and paper to make sure they
agree that gives you confidence when you're moving to situations that you can't just use pen and paper for right so you
should be able to get a known answer in a situation uh that you can work out
explicitly so that you build confidence when in situations where you can't work out the answer explicitly
so let's see how we solve this monty hall problem in pi mc
that was supposed to be a hidden cell um thank you for bearing with me so what does the pi mc solution looks like well
initially we have no information about which door the uh prize is behind so i'm importing
pi mc i'm creating a pi mc model with this context manager on normal you know python stuff we do and then i'm saying
hey this prize comes from a discrete uniform distribution all that means is
each number between 0 1 and 2 has the same chance of containing the door right there's a one-third chance
it's behind door zero uh the first door one third chance that's behind the second door a one-third chance it's
behind the third door so here i'm telling the story of how my data is generated before i get any
information don't know where the door is the prize is i should consider each door equally likely
then we need to think about things from the host perspective right which door is he going to open
based on the door we have chosen well we've chosen the first door so he's not going to open that door right there
would be no drama there if he opens it and there's a car we just stick with the open door if we open it and there's a goat
we switch doors and it doesn't really matter which one we choose so he's not gonna choose the door we chose which is
door one now if the prize is behind door one this
is the first row he won't open door one we've already said he could open door two or three
they both contain goats that's fine if the prize is behind door two he's not going to open door one again that
destroys any drama he's not going to open door two also because he doesn't want to show us the car because yeah we
would just switch to door to the door with the car okay so he's forced to open door three
right and similarly if the prize is behind door three he's forced to open door two so this is how monty makes his
decisions of which door to open so we can see if the prize is behind
door one there's a fifty percent chance he opens door two or door three if the prize is behind door two there's
a hundred percent chance he opens door three and similarly if the prize is behind door three there's a hundred percent chance he opens door two
so in the spirit of probabilistic programming we want to encode these chances he opens each door given where
the prize is as code so i'm going to import this package name asara uh and i'll i'll talk a little bit
about a sar in a second so just bear with me for the moment and i'm going to write p open which is the probability he opens each door as a
switch statement here so this is basically the asara version of an if statement
so this is saying if prize is behind the first door then we
know zero percent chance he opens the first door fifty percent he opens the second fifty percent he opens the third
that corresponds to this line right if it's not behind the first door a nested if statement if he's behind the
second door zero percent chance he opens door one zero percent chance he opens door two a
hundred percent chance he opens door three that corresponds to this row right and then this final thing if it's
not behind the first door not behind the second door gotta be behind the third door 100 chance he opens the second door
so why have i written this in terms of aet aceratenser.switch and it's
sarahtenser.equal instead of regular python if then else statements we'll answer that question in
a little bit and it's all about increasing the efficiency of our sampling algorithms through calculus
acera lets us differentiate expressions automatically without having to do any calculus with pen and paper in a way
normal python if then else statements do not and so that's the benefit we get by writing this
which totally could be an if-then-else statement and somewhat awkward uh syntax and we'll see that that's really
worthwhile once we move towards more complex examples so we've said our prior belief about
which door the prize is behind we've said what are the chances monty opens each door given where the prize is now
we know he opened the last door the third door and showed us a goat right
so which door he opened is now a categorical random variable it's either first second or third
the probability each door was open is that p open we just defined through those switch statements and we know he
opened the third door since this is zero index the third door corresponds to this two
right so we've said how where the prize is influences the
likelihood he opens each door and then what data we've observed now we're going to perform inference and
inference in pi mc is through sampling right this is where the mark this is where the monte carlo
uh inference comes into play so we sample from the posterior distribution to use the bayesian terminology
here and we get uh an x-array array of samples we're going to get in this case
um 20 000 samples from the posterior distribution of places the door of the
door the prize is behind in 20 000 different simulations of this right just like when we were using monte carlo
methods to approximate pi we had 5 000 different samples there a certain proportion of them were inside
the circle we use that proportion to determine pi we're going to do draw 20 000 random numbers here we're going to
see how many of them are zero saying the prize is behind the first door we're going to see how many of them are one
saying the prize is behind the second door and those proportions are what's going to tell us the probability that
determines whether or not we should switch doors so we get these traces here okay
uh these samples and i'm going to check the posterior prize so this says how
many times is the prize behind the first door right and this says one third of the
time the prize is behind the first door given the information we've input which means that two-thirds of the time it's
behind the second door so we should switch doors right this is the well-known result that we could derive uh with pen
and paper and i did on a previous slide so um
so it's good we've recreated this result using probabilistic programming by telling the story of how our data was
generated we have inferred what are the chances the sports car is behind the first or
second door and we know the correct answer is to switch door so let's step back a bit and talk about
the components i've just flashed in front of you two components we've talked about so far we've talked about pi mc
distributions right we saw discrete uniform and categorical distributions there so any
probability distribution almost any probability distribution you can think of from statistics
has been implemented in pi mc and if it hasn't they're actually quite straightforward to add those are some
good first contributions or to add uh new probability distributions it it's
kind of uh there's a lot of boilerplate code with um some small tweaks you can use and that's
actually really good and important work i like to think of these given that i'm a lego collector it's kind of the lego
bricks of probabilistic programming right distributions are the bricks pi mc lets
you put those bricks together in different ways to perform inference and so
in the case above we use the discrete uniform and the categorical distribution i've just chosen to highlight two to
show flexibility obviously there are normal distributions there are zero inflated poisson distributions if you
were to click on this link it takes you to the documentation shows you you know uh many dozens if not hundreds of
distributions that are available to build up your statistical models from and then what were those
aet.switch aet.eq things doing well they were invoking acera
which is pi mc's uh kind of tensor computational back-end right you can see
a little a little blurb from the acera documentation here i'm not going to read it out loud for you you can think of a
sarah as filling the same niche as uh you know pie torch
or tensorflow fill in uh various other you know maybe
deep learning ecosystems which is it will optimize your numerical computational graph
and it will also allow you to calculate derivatives of that computational graph without having to do the calculations
yourself and that is really key and what pi mc really leans on asera for to enable
efficient inference in high dimensions which we'll talk about in a little bit it's very much worth the uh
the cost of wrapping p open in these asera switch and sarah equality
statements to get uh you know automatic differentiation and optimization for
free uh once you start thinking in that way it's not hard to port a lot of the common code you would write to that and
the benefits are great so i'm going to walk through another kind of simple example to illustrate a
couple more key concepts and ingredients of pi mc
uh or dependencies perhaps before we ramp up to our more realistic lego examples so we're going to take a data
set from anscom's quartet an interesting you know instructional data set of four different groups of
data that have some similar summary statistics i want to talk a little bit about how we can use pi mc to do robust regression
today so let's focus on the third entry in this quartet that is almost exactly linear except for
one outlier i'm going to show you how you can really powerfully use these kind of swappable distributions in
pi mc to capture the linear trend here and ignore this outlier in kind of a toy example but
that's instructive for real life applications so let's see ordinary least squares
let's first do something we know will be uh prone to responding to this outlier
what are the mathematical assumptions behind your good old ordinary least squares regressions well
they are that all values of the slope and the intercept b and the um
[Music] the error the noise variants are equally likely so i'm going to build a pi mc
model that encode these things and flat here is the distribution the lego brick that
says all real numbers are equally likely to be the value of m all real numbers are
equally likely to be all the va uh the values of b all positive real numbers that's what
half here means are equally likely to be the value of theta right so you write out some math here
y is mx plus b plus some noise this is one way to view it this is equivalent
statistically to saying y has a normal distribution with mean mx plus b variance sigma squared so that's
how this is encoded in pi m c really really simple right and here you'll see
again we're saying hey i observed this data y 3 y 3 being the
y-coordinates of these data points in the third entry in the quartet
so we can come on down and we can sample from this model again you see pi mc's doing some sampling for us and it's
warning us some things there were five divergences after tuning so this is great pi mc is not only trying to infer
the values of m and b here but it's going to warn us when it's not doing a
great job when our monte carlo methods may be failing so this is a really powerful and user-friendly thing about
pi mc how can we visualize these divergences where inference is failing well we can
use one of the other important components of the the probabilistic programming ecosystem here which is rvs
a kind of sibling project to pi mc to plot the values of mb and theta that
are causing these divergences right so the the idea here is that rvs is a library
that contains a lot of pre-built visualizations and statistical routines
that will help you understand the results of your inference that you did with pi mc
and visualize debug those results right so what two more key components
that this uncovers one is rvs right and rvs is really cool because
it's agnostic about the probabilistic programming library you use to conduct the inference right the plots of
posterior distributions or divergences or you know many other things you want
to make are agnostic about whether or not pi mc is your program probabilistic
programming package of choice or stan is your probabilistic programming package of choice or
you know numpyro is your probabilistic programming package of choice there are many probabilistic programming languages
out there they all want similar visualizations rvs exists to say hey if
you provide the results of inference in a standard format we'll build those visualizations for you no matter what
uh upstream inference package you use so our visa is really awesome um i strongly encourage everyone to take
a look at it when we look at the results of inference right we took
we said pm.sample store this in ols ordinarily squares trace
we look at the type of that that's an rv's inference data object okay you just need to encode there was the samples
from your monte carlo uh approximations in this object get
uh rvs to work on it if we look at the posterior component of that trace that's an
x-array data set an x-ray is a really awesome library that's you know i think about it almost
as like pandas in more than two dimensions right it's a great way to define flexible
labels labeled arrays in arbitrarily many dimensions and this is exactly what
you want to do when you're doing invasion inference and visualization of bajan inference the results of bayesian
imprints excuse me and we will see uh where those labels are very powerful
um so there's a question why do we assume that it is normally distributed in the
ordinary least squares model that's a great question you can motivate the ordinary least squares model a bunch of different ways
uh right usually it's viewed as minimizing the mean squared error with a bunch of
restrictions applied that turns out to be equivalent to placing a a normal
likelihood on your data so it's just one of the assumptions that goes in here there are a bunch of different equivalent ways to frame it so
great question so what do we want to do well we want to
compare our inferred values of m and b to the robust values we would get if we
ignored the outlier and the outlier here happens when x is equal to 13. so we're going to use numpy's uh
polynomial fit of degree one same equivalent to ordinary least squares regression to get
the slope and intercept and we're going to compare those to our posterior distributions for m and b
that we got using pi mc right and you can see here
the reference value so the the robust slope is a little bit smaller than it's
a little bit towards the low end of our posterior credible interval we've slightly overestimated the slope
underestimated the intercept here this has really meant the details of this visualization aren't that important it's
just that all of this came out of you know one line essentially of rvs because rv's is this powerful showing what's the
true value that i'm trying to get out of inference versus what's the posterior distribution uh that happened when i fit
my model so we can also plot the lines here that come out of this inference right and we
can see okay the blue line is the average the posterior expected value
that you get using pi m c and the red line is what you get when you calculate the same thing using numpy
just to show that they agree when you when you match up your assumptions so that's good but we see you know this has
definitely been pulled upwards by the outlier underestimating the true intercept over estimating the
true slope due to this outlier how can we swap out some of those lego bricks
those distributions uh pi mc distributions in order to make
this less sensitive to that outlier well we can start out by saying let's
regularize a little bit let's move from ordinarily squares to ridge regression for those of you that are familiar with
the machine learning concept of ridge regression what is ridge regression it's really just normal priors on your
coefficients right so instead of saying m is a flat distribution here we're going to swap
that out and say it's normally distributed mean 0 standard deviation 1. how to choose these parameters as a
subject in and of itself um i've just chosen these here for convenience you could change them
um [Music] yes so reishima has asked if y were distributed
as a binomial yeah you could absolutely you know build say a logistic or a probit
regression model this way uh you know there are distributions for any kind of typical glm that you've seen out there
i've written up examples on my blog austinrockford.com around you know negative binomial regression using pi mc
or even you know ordinal regression to see what factors drive lego set star
ratings uh so the the number of distributions is really flexible really is legos the goal
is to enable you to build whatever models you want um using these building bricks here we're
sticking with normal observed likelihoods just for simplicity uh since this is an introductory talk
but this can get as complex as you like you can make a zero inflated binomial model or something similar uh with the
building blocks that pi mc provides a great question reishma um
so we're going to throw these on these pry these normal priors on the
uh slope and intercept this is equivalent to ridge regression i'd have to do some math to determine
what regularization penalty it is but it is equivalent uh we perform inference again we get a
little a few a couple fewer um a couple fewer warnings
uh but there are still some divergences we can check our plots again our true values our robust values of the slope
and intercept are actually even further from what we've inferred so huh well this makes sense
regularization of this sort tends to be more useful let me find the correct
thing when your x values are extreme and not your y values so we've added in
robustness to extreme x values but not to extreme y values actually reshma this will speak
to your next question which is
um which is how can we change that
observational likelihood to be more robust against this extreme y value well to get robust regression out of
this we can use the fact that students t distribution has fatter tails than a normal distribution
right and so here the blue line is the standard normal distribution the orange
line is a student's t distribution with two degrees of freedom we see the tails are fatter
this will respond due to those fatter tails this will respond be less sensitive to outlying y values
so we'll specify the model for m b and theta same as we did before but we'll change y obs here from a
normal random variable to a student t distribution we have to specify a prior on our
degrees of freedom here i say it's a uniform between one and ten
uh right because as the degrees of freedom goes to infinity the t distribution becomes a normal distribution it's less interesting if
the degrees of freedom is huge we expect it to be small here because there is an outlier but similarly you know mean is
mx plus b sigma is sigma we perform inference we see
in our posterior plots we've captured the true robust values of these things quite well
we can visualize the outcome lines here the green line is going right through
our uh you know non-outlier data points and the outlier has been ignored this is
a little bit of a silly example right because there's no noise in the these things outside of the one outlier that's
why these posterior distributions look real wacky you could add a little noise here to make it more realistic the point
is these distributions are easily swappable uh chandra asks can we model using
tweety regression i believe we can uh if not it's possible to add
just depends has the distribution necessary been pre-built
how do we choose hyper parameters and can we go wrong by choosing a distribution um
just like right this is just one kind of tool in your statistical toolbox out of many
uh you can always go wrong choosing your statistical tools so there is tons of
uh ways to choose these prior distributions and evaluate the goodness
of your choice that are really outside the scope of this talk at the end i will give
some references that will include um discussions of these sorts of things but
that's a great question so yes absolutely you can make bad decisions if you don't choose the appropriate tools
as with any statistical method um and how to how to decide whether you are
in a good spot or not is something that's outside the scope of this talk but there are many references out there
that i will share so why are we using a sarah again and
i'm going to move quickly here to make sure we have time for q a at the end um
so apologies but hopefully it will still be somewhat clear why are we using uh
a sarah why are we using that aet.switch aet.equal and under the covers here when
we write uh you know pm.normal there's a bunch of acera operations going under
the covers when we call this class constructor uh pi mcs just providing a nice layer
over those for you so you don't have to think about them unless you want to um
why we use the sarah is to do something called hamiltonian monte carlo and what makes it hamiltonian what makes it
hamiltonian is that you want to take into account the geometry of the model you're specifying
right and geometry in the sense of differential geometry is all about curvature
and curvature is about derivatives in a past life i was a differential geometer
and the way i think about it is curvature is always about the failure of partial derivatives to commute
so uh you know if you differentiate with x and then differentiate with respect to y
and then do the derivative of y with respect to y and then the derivative with respect to x
those two things are the same in flat space on the plane they are in general different if you're
on a curved surface um and that is how geo geometers or at
least a certain set of geometers think about curvature and that's actually really important for doing this kind of
inference in high dimensional models models that we're really interested in here as
accurate representations of reality due to the curse of dimensionality which you're probably familiar with from
machine learning what's the way the curse of dimensionality uh is reflected
geometrically is that the volume of the unit sphere as the number of dimensions
increases goes to zero exponentially fast right so
you can think of the unit circle and then the uni in two dimensions the unit sphere in four dimensions the unit hyper
the unit sphere in three dimensions there's a hyper sphere in four five six seven dimensions what's that volume
it drops to zero really fast as the number of dimensions increases even mildly here we can see that between 100
and 1000 dimensions it's already you know 10 to the negative 196
right what this means is that typical points in your data set near the origin there's a nice definition of typical but i i
won't get into what that exactly means here are hard to find in high dimensions
if we just treat the spaces flat when we take into account the curvature
of the space they become a lot easier to find and asera is what lets us do that by automating those derivatives we need
to cut um to quantify curvature
so where does acera really help us as a toy example let's show using a sarah that the derivative of x cubed is 3x
squared familiar to all of us who have taken calculus right so we're going to define x as a scalar y
is x cubed then we're i'm going to ask acera to pretty print the gradient of y
with respect to x that should be this derivative let's walk through this and i'll tell you how to read it
so this says fill x cubed with x is one okay so one cubed is one this term is
just one one times three okay so we've got this coefficient of 3 here
and then we've got times x to the 3 minus 1. so we've got 3 from this term we've got x cubed from this term
beautiful right so sarah is doing calculus under the covers
to quantify the curvature of space to make sure that we are not dealing with these absurdly small numbers
uh that we would if we didn't acknowledge the that geometry all a bit hand wavy right now
there are great uh references here uh there's no way to go super deep on that
theory in an introductory talk i just want to wave at why it's important so
let me take about five minutes to run through this final example where the shape of the posterior does
become important and in that tweety regression or that binomial or ordered logit or probit regression this becomes
a lot more important than our you know simple to you know univariate linear model
with like two or three parameters um it becomes a lot more important as your
model gets more complex and realistic so let's talk briefly about a bayesian analysis of lego prices so on the left
here you have a small sampling of my lego collection to my left here there is some of it as well
i came at this with a motivating example of here's darth vader's meditation chamber a set lego released uh last year
in 2021 is this model worth the 70 they were charging it or not how can we answer
that question with data so i went out there and
scraped a bunch of data from brickset which is just a lego fan site that has references all sorts of data you can see
it here i've made it available online you can find the data if you're interested in that sort of thing you'll see there's about
6400 sets from 1980 to 90 uh to 2021
you can see there's a pretty good i'm gonna move quickly here because i know we have some other things to cover off in the end and i want to leave time for
q a um there's a log linear relationship between the piece count of the set and
the price of the set that's not surprising right lego has both fixed and variable costs in manufacturing it sets
larger um sets obviously will incur those larger variable costs so they're going
to charge us more to cover their costs plus whatever profit that that makes perfect sense we see where darth vader's meditation
chamber kind of fits in this we can look at to make this model a
little more interesting a little more realistic we can look at how these prices have varied over time
so how has the recommended retail price per piece varied over time and we could
see where darth vader's meditation chamber fits in there we can break this down
by different themes so we can see star wars creator harry potter and disney
we can build a model i don't know why my math is not working here apologies for the last minute
change that says you know basically each component the intercept and then
the price per piece uh has a time varying component so a gaussian random walk and you can come
into the notebook and see how this is built up here from our pm.normal lego bricks
and a theme component which is a hierarchical normal random variable you can see how that's built up here this is
not a ton of code to specify a fairly flexible model with both time and theme effects
if we try to sample this without hamiltonian monte carlo i can do this using a metropolis step
which is the simple naive thing to do i get a ton of errors if i calculate the r
hat statistics here i see they're enormous they're up above three for some of these things r hat statistics of one indicate
convergence so i shouldn't trust this these results this is one of the ways
that pi mc tries to help you decide if your results are trustworthy or not you can see this took about a minute
let me do this with right there 351 parameters here so there's a lot of them
uh let me do this with the more advanced samplers the hamiltonian monte carlo nut
stands for no u-turn sampler that that pi mc provides this takes nine
minutes ten minutes uh but produces much better r hat values that are close to
one indicating no problems with convergence there and therefore much more trustworthy
uh posterior inferences so the top is the using the metropolis step that falls
prey to the curse of dimensionality you can see that these samples are wild all over the place i have no reason to
expect a true posterior distribution would have this wild form so this shows inference problems whereas the
hamiltonian monte carlo ones have this beautiful shape much more plausible
um so even though it took about 10 times as long to sample the sampling efficiency
is uh just so much bigger right so the sampling time is an order of magnitude
longer for the adaptive hamiltonian monte carlo nuts is hmc yes for daniel's uh question nuts is
a form of hmc but the sampling efficiency in terms of effective sample size per second is an
order of magnitude larger so to get the same effective sample size with your
metropolis step you'd actually have to sample that model for 25 hours so 25 hours versus 10 minutes i'll take
10 minutes any day of the week so we can start to answer some of our questions pretty easily using pi mc here
we can look at the posterior distribution of what we would expect darth vader's meditation chamber to cost based on the
fact that it's a star wars set released in 2021 uh that has however many pieces i think
it's about 700 pieces so our models uh posterior expected value was 79 we
see that the 69.99 lego price to that is just below the median so it's pretty reasonably priced we can assess the
pretty easily the effect of different set uh types on the price of legos right so creator
sets are a little less expensive star wars sets are a little more expensive etc etc
here i've included some uh two books that are available online
for free this one on the right is also for purchase so the left is a classic reference
middle is a great reference written in r with stan but we have posted notebooks
implementing it in python with pi m c for all of the examples and exercises
here really great textbook to help you think about bayesian inference and then on the right is a textbook about
bayesian inference actually written by osvaldo raven and jung peng who are three piemce core
contributors that just came out uh has an online version available also available for purchase in print
i thank you for your attention again i thank reishma and the data umbrella team
for their hard work putting this together and oriole and nina for me
excuse me for their coordination uh there and at that point i will hand it back to reishma to perhaps say a
little bit more about the sprints this has been intended to introduce to you
uh and then i can take any q a after after reshma uh fills that in so thank
you all for your time i really appreciate it so yes actually me now is going to be
speaking a little bit about the upcoming sprint so i'll wait um if you know if you're able
to share your camera that would be great
right no
yep we can hear you so i'll share my screen
okay
oh i think my screen is visible yes it's loading yep now we can see the slides yep yeah so
uh yep uh so we're going to be organizing a sprint with data umbrella like the
entire biomc team on 18th and 19th february like that over that weekend
uh we haven't finalized the exact time slots yet but we're hoping to do like two hours uh
to our slots over the weekend so we can accommodate more time so
[Music]
two of these have already happened and two more of these will be happening and i really think you should be
like uh if you can make the time either attend them or watch a recorded version if you want to contribute to yfc like
austin gave a wonderful introduction on how to use pyeomc and why you should be interested in probabilistic programming
next we have a talk that ricardo will give which is more about you know how can you contribute to ymc
like how can you help out with the source code or other things and then martina will be giving a talk on
how do you contribute to the documentation like say if you're not really interested in contributing to code but you know a bit of statistics or
maybe you just want to contribute to the technical part which is again important
so these talks would help you out with that and finally then out month from now we have
the sprint uh a couple of us in the pi you should always be to help you through polar
request surveys and i mean i i think it would be really fun and
if you're new to open source i would really ask you to join in and
yeah that's that and thanks to vishwa for putting together this whole thing it's been like so much work
yes um let me turn on my camera um yes i want to thank um the entire um
pi mc team because um they initiated this sprint and there's a series of
webinars you can see that we've had two of them already that are on our youtube and we'll have two more upcoming which
will be available on our youtube and the meetup group is the best place to sign up
because that's where you're going to be able to find out about the upcoming events um
so meanwhile i have a couple of questions for you like for instance for the sprint how much of a background do
people need to have to be able to participate in the sprint
like uh i think uh people of like the background can be really varied like you can contribute to various areas
actually you can if you know some statistics you can contribute to documentation if you've already
contributed to open source but you don't know stuff about probabilistic programming there are still housekeeping issues that you can always contribute to
you know there's a bug there's a type or anything so as long as you're like slightly comfortable with using github
you have the local setup you'll be able to make at least some meaningful contribution and all of us will be there
so we'll try to help you out like you know according to whatever you know what can you best help out with so i think uh
it won't be like that much of a problem like background wouldn't be that much
with any background as long as you can like you have your github setup and everything that's something that we want
you to do and yeah after that it's all good from there
great um if anybody has any questions on the upcoming sprint feel free to post in
the chat and it will also be helpful for us because it means that we'll be able to sort of
include that information in the in the event so other people
probably if you have questions other people have questions as well
oh by the way uh austin we were speaking before but the question that i asked you um which would be great to share with
the group is um you know if you want to talk just a little bit about how this um the new version of pymc which is a major
major release if you just want to talk about that for people who weren't here before sure sure so if if you go and find on
the internet even even on my website for example you'll see a lot of
materials that reference pi mc3 and it's been quite an evolution to get
from pi mc3 to uh what we're currently calling pi mc which which under the covers is the
fourth version of pi mc the uh kind of saga has been that um
i talked about a sarah here for a while right um
a previous version of of acera was supported by a research group i believe out of the university of montreal
it was called fianno one of the first kind of tenser libraries for deep learning you could think like an
intellectual precursor to your tensorflows and and your pie torches and your terrace and all that
um a couple years back that lab announced that fianna would no longer be
supported by mc you know was kind of casting around for a new tensor back end
dabbled in in pi torch and mxnet and uh tensorflow for a while eventually you
know some great pi mc folks decided they would rather just take over sarah continue to maintain or take over thiano
rename it as sarah to show that it's a new project continue to evolve that to not just meet the needs
of pi mc although that's a big part but make it an interesting uh tensor package
and modern tensor package in and of itself and we've taken that uh that opportunity to call this pi mc v4
introduce some modernizing changes around a focus on rvs
uh focus on x-ray that were not present you know 10 years ago or whatever when pi mc3 v3 were started so we kind of
took that back and changes an opportunity to make some lightly breaking changes
uh to make pi mc a little more modern and i do say lightly breaking because by and large most of the code that i have
on my website um that runs on pioneer c3 will run on pi
mc now if you just change uh thiano to acera in a few of the right
places so it was more of an opportunity to modernize when changing back ends than anything
else is what i would say and it's some really exciting stuff i think to to work with and the acera
folks are doing uh really great and interesting work not only to support pi mc but in kind of tensor computation in
general you can you can do things like use c as a back end you can use jacks as
a back end i think there are maybe it's a number back end for your acera code there's all sorts of interesting um
kind of lower level programming language focus things that they're doing there so it is
the foundation of pi mc v4 for sure but it's so much more so worth thinking
about in its own right so i just want to read a comment that oriole who is a um maintainer for pi mc
foot regarding the sprint which is as an example of how how wide the range is our docs have a glossary of statistical
terms to which you can contribute with no python or pi mc knowledge only stats
i think really what's going to be unique um what is unique about this pi mc sprint is that you know
people from all different backgrounds in terms of what knowledge they have whether it's
statistics or python or just very can can contribute to it you don't have to be a pi mc expert
to learn how to to contribute to pi mc i'm just going to turn my camera off and i'll let austin answer a few of the
questions that are in the chat
all right folks let me scroll up a little bit pi mc3 is not being continued i forget
to what extent it's going to get bug fixes i think that is addressed uh
either on the pi mc website or in uh the pi mc discourse
which is quite an active forum we have for answering questions but no most of the active development is happening on
pi mc v4 which like i said it is almost most models written in pi mc3
require some very light like five minutes of porting to get them to work on v4 so it's all you know designed to
be as seamless as possible but you know we can't promise it's 100 seamless
um [Music] arielle mean all shared to answer ariel's
question i believe then claudia does the definition of custom probabilities change substantially so
yes one of the things that does change under the covers i've talked a lot about um
the interface the the user-facing interface has not changed a ton a little bit with the advent of rvs and and the
emphasis on x-ray uh the implementation of pi mc distributions under the covers has
changed a fair bit now if you've done it before it's not that hard to learn the new way
but it is not exactly what it was like before in v3 so it's where you've maybe implemented some custom distribution
there's a new pattern to follow i would say it's a better pattern now that sarah has kind of taken what thiano did and
made it more uh more suitable for pymc so i would say there's a newer better pattern to follow there but there is a
little bit of work if you've done something like implement a custom distribution
and uh as you say hopefully more documented um that's something that can definitely be
helped in this sprint i think there's a lot of uh my experience contributing new
distributions is a lot of find a distribution that's similar look at the patterns there
uh adapt it to your situation and test but i think that's a great place that our our documentation could for sure use
some improvement is in adding new distributions so i think it'd be a great uh sprint activity to contribute to
um also a great discourse topic so benjamin has asked are there differences
between probabilistic programming noteworthy significant reason why would you why you would choose pi mc over stan
uh so i'll answer this on a couple of different levels mathematically they're all kind of striving towards the similar
ideal of adaptive hmc when you really get down to the nuts and
bolts of it there's going to be some algorithmic differences between how they implement it but by and large those
should not matter to 95 99.9 percent of people
um i certainly couldn't tell you what they are off the top of my head and i use pi mc quite a bit
um you know stan stan versus pymc depends do you you know
do you want to have an all python interface to your programming language is that important for your application if so pymc might be a better fit for you
stan has its own you know model definition language um
strengths and minuses of each as a like a software library are you know
the language you use to specify models whether that's pure python for pi mc or you know stan's domain specific language
that's kind of a a subset of c plus plus neither is better than the other stan is
great software we are actually great friends with the folks on the stand team get dinner with them all the time when
i'm in new york um it's just it depends on your application
where you need to be running this code what kind of tasks you're doing with it so you know i
tend to be living in jupiter notebooks like everything to be concentrated in python like to be able to do my
contributions in python and not in c or c plus plus so i've chosen pimc
uh v3 and pymc as it's now renamed but you know i don't fault anyone who chooses to go down the stand rabbit hole
either i think they both have their strengths and weaknesses both uh numb focused supported projects so
and there's a lot of cross-pollination of ideas uh particularly manifested in rvs right rvs is a visualization
diagnostic library intended to be shared between pi and c and stan and several other
inference engines so really you know gets down to the nitty gritties of how do you choose any software package
uh what what your environment is like what your needs are um
chandra asks uh simulating conditional
probability can you share resources about that there's no resource that comes to my mind uh it's more about
taking those lego building blocks and putting them together if you had a specific question in mind you're saying
like a high school conditional probability problems probably you know involving colored balls in urns or
something like that uh i'd say that the best way to get help there is take a stab at it yourself
uh maybe you solve it great maybe you don't no worries uh there's a learning curve
here bring your you know partially worked example to discourse and there's an entire group of really
friendly people on discourse uh that sometimes includes me when i have the entire in the time but there's a great
community there that are very helpful in getting your questions answered so um
i don't know that there's a definitive reference for those sorts of problems my my advice in general would be take a
stab at it and then turn to discourse with your your partial progress and folks they are extremely friendly and
will help you um [Music] help you drive that to to resolution
and daniel points out yes adam downey's book i think think babies is probably the one
you're referencing is an excellent uh reference for those sorts of problems
um austin there's a couple of questions that i just put in the q a chat one is from luke and he writes hi austin great
talk have you played around with the infinite monty hall python problem with pi mc
so infinite doors rather than just three of them um infinite doors rather than just three
well pi mc does not deal well with infinities right just like any they're right you can't
specify an infinitely long numpy vector and if you think of acera as like numpy
fancy numpy not that's not technically correct but it's spiritually correct
it's hard to um answer that question
computationally i think and mathematically when i put my mathematician hat on because i am trained as a mathematician you know it
depends what kind of limiting process you're using to get to infinitely many rooms what the right answer is there so
the answer is no i have not done much thinking of it and when you start to deal with things that are infinite in pi
mc or any probabilistic programming language this isn't a an inherent limitation of pi m c you run the same
problems with stan when you try to implement things like gaussian processes or dirichlet processes
where there are you know potentially infinitely many parameters uh it generally takes a lot of hard work
um to find good approximations or truncations there so i think that's a subtle question and the answer is not
really um and then uh
there's sorry there's another question in the q and a tab maybe you answered it but i'm not sure does the best yes i i i
think i did get get to that one so if i come back i probably have time for about one more chandra's final question maybe
from the chat so pomegranate mc pyro tf probability how is it different
from pimc or stan yeah so these are all different things that let you specify
basically likelihood functions right and some of them are just specifying likelihood functions some are
performing inference on all on likelihood functions um so
they are all adjacent right if you think about pi mc it's a way to specify a model
perform inference on that model and then rvs is a way to visualize and quantify the uh the
inference you've done on that model uh each of these hits some or all of those
things that pi mc does and has flexibility more or less flexibility
in some of those places right there's no one probabilistic programming package to rule them all right so there's a couple
different things you need to specify your priors you need to specify your observational likelihood you need to
perform inference you need to draw conclusions from that inference if you think of those things as kind of layers
in getting value out of any probability model uh each of these packages speaks
to one or more of those layers it does then it does each of those layers differently
that has strengths and weaknesses in terms of completeness and flexibility that vary by those layers and so
uh not really possible to give an exhaustive answer here but cover some of the same ground is the
answer does some things more or less flexibly does some things more or less user
friendly right one thing that i love to buy mc is when you look at i said m is equal to pm.normal you know m as a
string zero one that's quite close to the mathematical notation you would use to write that model down very user
friendly is what pi mc goes for things like uh mc make you write a log
likelihood function on your own tf.probability a little less user friendly
um so just you know depends which trade-off is right for you but they play in similar areas and you can assemble
some of those other tools together to cover the same use cases as you would imc here it all depends on where you
need flexibility and where you need need or want simplicity and ease of use
i think that that's probably a uh a good place to call it a day and thank you again reishma and data
umbrella team and meanwhile and oriole on the pi mc side for organizing things again it's been a a real pleasure to
speak to all of you here today
oh ray schwein i think you're muted i wanted to thank you austin for your
presentation