Nature’s rejection sampler, 30 points — HOMEWORK #3 – V3 Solution

$35.00 $24.00

Nature’s rejection sampler, 30 points. A light source emits photons whose angle x is normally distributed around :   p(xj ) = N (xj     =  ;  2 = 42)   Then the photons pass through a grating which absorbs photons with a probability that depends on their angle. We denote the event that a photon…

You’ll get a: . zip file solution

 

 
Categorys:

Description

5/5 – (2 votes)
  1. Nature’s rejection sampler, 30 points. A light source emits photons whose angle x is normally distributed around :

 

p(xj ) = N (xj     =  ;  2 = 42)

 

Then the photons pass through a grating which absorbs photons with a probability that depends on their angle. We denote the event that a photon makes it through the grating by saying that the Bernoulli variable g = 1. For this particular grating, p(gjx; ) = Ber(gjf(x; )) where

(1.1) f(x; ) = sin2(5(x   )) :
25 sin2(x   )

 

The resulting unnormalized probability p(x; g = 1j = 0), is shown below as a function of x.

 

 

 

 

 

  • [4 points] Use numerical integration (i.e. a brute-force sum over many values) to estimate

 

the fraction of photons that get absorbed on average, p(g = 0j  = 0). That is, estimate

 

R

p(x; g = 0j = 0)dx by summing over 10,000 values of x, equally spaced from  20 to 20.

 

  • [4 points] Implement a rejection sampler to sample from p(xjg = 1; = 0). Hint: Since

 

it’s a pmf, p(gjx; = 0) 1. Another hint: doing ancestral sampling in the above model forms a rejection sampler. Plot a histogram 10,000 accepted samples, and print the fraction of accepted samples.

 

  • [4 points] Implement a self-normalized importance sampler to estimate the fraction of all

 

photons that get absorbed, p(g = 0j = 0). Use p(xj = 0) as a proposal distribution. As a reminder, the formula for a self-normalized importance sampler for the expectation of a function f over an (unnormalized) target distribution p~(z) and proposal distribution q(z)

 

 

K p~(xi)
1 Xi  
q(xi) wherexi   iid q(x)
(1.2) e^(x1; x2; : : : ; xK ) = K f(xi) K P j=1 q(xj)
=1 1 K p~(xj)

 

If you can derive a simpler expression by canceling terms, you don’t have to implement the entire formula. Print the estimate of the fraction of photons that are absorbed, using 1000 samples.

 

(d) [4 points] A physicist is trying to estimate the location            of the center of a light source.

She has a Cauchy prior p( ) =                  1           . She observes a photon emitted from the particle

10 (1+( 10 )2)

at position x = 1:7. Plot the unnormalized density p(x = 1:7; g = 1; ), as a function of  .

 

  • [10 points] Write a Metropolis-Hastings sampler to sample from p( jx = 1:7; g = 1). For the proposal distribution, use a Gaussian centered around the current sample. Fiddle with the proposal distribution and the number of iterations of your sampler until the histogram of samples looks roughly like the true unnormalized density. Plot the histogram of samples from your chain.

 

  • [4 points] The scientist wants to know how con dent she should be, based on her observation of the photon, that her instrument is calibrated to within 3 degrees around zero. Use samples from your MH chain to estimate the posterior probability p( 3 < < 3jx = 1:7; g = 1).

 

  1. Gradient estimators, 20 points. All else being equal, it’s useful for a gradient estimator to be unbiased. The unbiasedness of a gradient estimator guarantees that, if we decay the step size and run stochastic gradient descent for long enough (see Robbins & Monroe), it will converge to a local optimum.

 

The standard REINFORCE, or score-function estimator is de ned as:

 

(2.1) g^SF[f] = f(b) @ log p(bj ); b   p(bj )
@

 

  • [5 points] First, let’s warm up with the score function. Prove that the score function has zero expectation, i.e. Ep(xj )[r log p(xj )] = 0. Assume that you can swap the derivative and integral operators.

 

  • [5 points] Show that REINFORCE is unbiased: Ep(bj ) f(b)@@ log p(bj ) = @@ Ep(bj )[f(b)].
  • [5 points] Show that REINFORCE with a xed baseline is still unbiased, i.e. show that Ep(bj ) [f(b) c]@@ log p(bj ) = @@ Ep(bj )[f(b)] for any xed c.

 

  • [5 points] If the baseline depends on b, then REINFORCE will in general give biased gradient estimates. Give an example where Ep(bj ) [f(b) c(b)]@@ log p(bj ) 6= @@ Ep(bj )[f(b)] for some function c(b), and show that it is biased.

 

The takeaway is that you can use a baseline to reduce the variance of REINFORCE, but not one that depends on the current action.

 

  1. Comparing variances of gradient estimators, 25 points. If we restrict ourselves to consider only unbiased gradient estimators, then the main (and perhaps only) property we need to worry about is the variance of our estimators. In general, optimizing with a lower-variance unbiased estimator will converge faster than a high-variance unbiased estimator. However, which estimator has the lowest variance can depend on the function being optimized. In this question,

 

we’ll look at which gradient estimators scale to large numbers of parameters, by computing their variance as a function of dimension.

 

For simplicity, we’ll consider a toy problem. The goal will be to estimate the gradients of the expectation of a sum of D independent one-dimensional Gaussians. Each Gaussian has unit variance, and its mean is given by an element of the D-dimensional parameter vector :

 

D
Xd  
(3.1) f(x) =    xd  
=1 D xd#
(3.2) L( ) = Ex  p(xj )[f(x)] = EiidN( ;I)
Xd
=1

 

  • [4 points] As a warm-up, compute the variance of a single-sample simple Monte Carlo estimator of the objective L( ):

 

D
Xd
(3.3) ^
LMC =    xd;where each xd   iid N ( d; 1)
=1
  • i

^

That is, compute V LMC          as a function of D.

 

 

(b) [5 points] The score-function, or REINFORCE estimator with a baseline has the form:

 

(3.4) g^SF[f] = [f(x)  c( )] @ log p(xj ); x   p(xj )
@

 

Derive a closed form for this gradient estimator for the objective de ned above as a de-terministic function of , a D-dimensional vector of standard normals. Set the baseline to

D @ log p(xj ), you shouldn’t take the derivative
c( ) = d=1  d. Hint: When simplifying
@
x, even if it depends on  . To help keep track of what is being di erentiated, you
throughP   @
can use the notation g( ; ) to denote taking the derivative only w.r.t. the second  .

@

  • [8 points] Derive the variance of the above gradient estimator. Because gradients are D-

 

dimensional vectors, their covariance is a D D matrix. To make things easier, we’ll consider only the variance of the gradient with respect to the rst element of the parameter vector, 1. That is, derive the scalar value V g^1SF as a function of D. Hint: The third moment of a standard normal is 0, and the fourth moment is 3. As a sanity check, consider the case where D = 1.

 

  • [8 points] Next, let’s look at the gold standard of gradient estimators, the reparameteriza-tion gradient estimator, where we reparameterize x = T ( ; ):

 

(3.5) g^REPARAM[f] = @f @x ; p( )
@x @

 

In this case, we can use the reparameterization x = + , with N (0; I). Derive this gradient estimator for r L( ), and give V g^1REPARAM as a function of D.

 

  1. Complete course quiz, 5 points. We’re planning to update the prerequisites for this course next year. To help us plan, go to this course’s page on quercus, and ll out the short quiz on which courses you took before this one, and how well they prepared you for the di erent parts of this course.

 

3