Writing your own fits

bayesmsd can fit any parametric shape of MSD curves; however, utilizing that full flexibility requires a little bit of coding. This tutorial will walk you through the implementation of a new fit model, TwoLocusRouseFit. Note that a production version of this class is available in bayesmsd.lib; after working through the—somewhat streamlined—implementation in this example, you can check its source code to see the last finishing touches.

For this example we will implement a fit to the following functional expression: \begin{equation}\label{eq:MSD} \text{MSD}(\Delta t) = 2\sigma^2 + 2\Gamma\sqrt{\Delta t}\left(1-\text{e}^{-\frac{\tau}{\Delta t}}\right) + 2J\,\text{erfc}\,\sqrt{\frac{\tau}{\Delta t}}\,,\tag{1} \end{equation} where \(\tau\equiv\frac{1}{\pi}\left(\frac{J}{\Gamma}\right)^2\).

If \(x_1(t)\) and \(x_2(t)\) are the spatial positions of two loci at fixed positions along a polymer, the above expression is the MSD we should expect to see for the relative position \(y(t)\equiv x_2(t) - x_1(t)\), assuming the polymer dynamics follow the Rouse model.

So how do we fit something like this? bayesmsd provides the whole fitting machinery in form of an abstract base class Fit, which we can subclass to implement a new fit model. Upon doing so, our main task is to override the method params2msdm(), which gives the functional form of the MSD dependent on the parameters. Of course we also have to specify which parameters we need and how they relate to each other. Let’s get started!

Class definition and constructor

import numpy as np
from scipy.special import erfc

import bayesmsd
from noctiluca.analysis import MSD

class TwoLocusRouseFit(bayesmsd.Fit):
    def __init__(self, data, motion_blur_f=0):
        super().__init__(data)
        self.motion_blur_f = motion_blur_f
...

So far, not much happened; we called the constructor of Fit, which will take care of input handling, and also populates the attributes

  • self.d: number of spatial dimensions in the dataset,

  • self.T: length of the longest trajectory.

The motion_blur_f argument will be used below to take into account finite imaging exposure times; for now we just carry it along.

Now we have to specify some details about this fit. Specifically: set self.ss_order, populate self.parameters, and potentially self.constraints.

...     self.ss_order = 0

With the MSD given above, we are always in a steady state of order 0; for other fitting schemes, this attribute might be determined from the input.

...
        self.parameters = {
            'log(σ²)' : bayesmsd.Parameter(bounds=(-np.inf, np.inf)),
            'log(Γ)'  : bayesmsd.Parameter(bounds=(-np.inf, np.inf)),
            'log(J)'  : bayesmsd.Parameter(bounds=(-np.inf, np.inf)),
            'log(τ)'  : bayesmsd.Parameter(bounds=(-np.inf, np.inf)),
        }
...

We initialize all the parameters that appear in the expression \eqref{eq:MSD} and provide their domains; since we work in log-space, all parameters have the whole real line as domain; so we just set the bounds to infinity.

Having defined the parameters, we can now fix any pre-determined relationships between them:

...
        def fix_tau(params):
            return 2*(params['log(J)']-params['log(Γ)']) - np.log(np.pi)
        self.parameters['log(τ)'].fix_to = fix_tau
...

Note that in this example we could also have omitted \(\tau\) from the parameter list and just hardcoded this identity, since \(\tau\) should never be independent. Sticking with this implementation, we would at least move fix_tau() to be a static method, instead of defining it at runtime (c.f. summary below).

Finally, we can list any inequality constraints that the fit parameters should satisfy by default.

...     self.constraints = []

For the Rouse MSD in this example there are no additional constraints to be satisfied. Note that if you do not reset it, self.constraints defaults to [self.constraint_Cpositive]. This constraint function is provided by the Fit baseclass and ensures positive definiteness of the covariance matrix. It is costly to evaluate, so if positive definiteness is guaranteed by the shape of the MSD (as it is here) you can omit it. Other notable examples:

  • a powerlaw MSD with \(\alpha\in[0, 2)\) is positive definite

  • for a spline interpolation we have no way of ensuring positive definiteness, so use constraint_Cpositive

MSD definition

Onwards to the key part: how to calculate an MSD from the parameters we set up above.

...
    def params2msdm(self, params):
        sigma2 = np.exp(params['log(σ²)'])
        Gamma  = np.exp(params['log(Γ)'])
        J      = np.exp(params['log(J)'])
        tau    = np.exp(params['log(τ)'])

        @bayesmsd.deco.MSDfun
        @bayesmsd.deco.imaging(noise2=sigma2,
                               f=self.motion_blur_f,
                               alpha0=0.5,
                              )
        def msd(dt, G=G, J=J, tau=tau):
            return (  2*G*np.sqrt(dt)*(1-np.exp(-tau/dt))
                    + 2*J*erfc(np.sqrt(tau/dt))   )

        return self.d*[(msd, 0)]
...

That’s already the full definition of this function. After assigning local variable names to the parameters (mainly for readability) we define the MSD function and then return it. Let’s take a look at some details, starting from the end:

  • params2msdm should return a separate MSD function for each of the self.d dimensions in the data set. In this simplified implementation of TwoLocusRouseFit we assume that all dimensions are identical, so we can just copy the same values to all dimensions.

  • Beyond the MSD, params2msdm should also return the first moment for the Gaussian process, which corresponds to a constant offset (for ss_order = 0, i.e. stationary processes) or a drift term (for ss_order = 1, i.e. increment stationary processes). For the most part this should just be zero; this is why we return (msd, 0) tuples instead of just the msd functions for each dimension.

  • In defining the msd function

    def msd(dt, G=G, J=J, tau=tau)
    

    we pass all the parameters as default arguments. This is important to get the scoping right; default arguments are evaluated at definition time, whereas references to external variables (i.e. if we did not use this default argument construction) are evaluated at execution time. By then the values might have changed from what we wanted to use in the definition.

  • The bayesmsd.deco.imaging() decorator adds imaging artifacts to the MSD, specifically motion blur and localization error. Motion blur is determined from the fractional exposure \(f\in[0, 1]\)—this is the ratio of shutter time over frame time, i.e. \(f=0\) is ideal stroboscopic illumination, \(f=1\) is continuous illumination—and the effective short time MSD scaling \(\alpha_0\). The latter is just the log-slope of the “true” MSD (no motion blur or localization error) at a time lag of 1 frame, which for our Rouse model example is always 0.5.

  • Finally, the bayesmsd.deco.MSDfun decorator extends the validity of the time lag dt to the real line. Note how we did not have to take any precautions against division by 0 or square roots of negative numbers in defining the msd() function. Indeed, the MSDfun decorator guarantees that the dt argument is always a numpy array with strictly positive values: anything else will be patched using the analytical identities \(\text{MSD}(-\Delta t)\equiv \text{MSD}(\Delta t)\) and \(\text{MSD}(\Delta t) = 0\).

Initial values

A Fit should be able to come up with a rough initial guess for parameters from the data. This can be more or less sophisticated, but of course the better the initial values, the faster the fit will converge, so it’s always good to give a best effort estimate here. We will resort to eye-balling the empirical MSD (exactly the methodology that bayesmsd is set to improve upon).

...
    def initial_params(self):
        e_msd = MSD(self.data) / self.d

        J = np.nanmean(np.concatenate([traj[:]**2 for traj in data],
                                      axis=0,
                                     ))
        G = np.nanmean(e_msd[1:5]/np.sqrt(np.arange(1, 5)))
        sigma2 = e_msd[1]/2

        return {
            'log(σ²)' : np.log(sigma2),
            'log(Γ)'  : np.log(G),
            'log(J)'  : np.log(J),
        }
...

Note that this implementation is not safe against any of the first 5 MSD points missing. While this is fine in most realistic situations, the library implementation adds a few additional checks here.

Prior

bayesmsd being a Bayesian method, we have to specify a prior over the parameter space. This does not have to be a normalized probability distribution, i.e. can be an improper prior. By default, Fit assumes a uniform prior over the specified parameter space:

...
    def logprior(self, params):
        return 0
...

Since we are using log parameters, this translates to a \(\frac{1}{x}\) prior over \(\sigma^2\), \(\Gamma\), and \(J\), which is appropriate for (a priori) scale free, positive parameters. Often, it is more convenient to transform the parameter space and use a uniform prior, than putting some complicated prior on the parameters.

Summary: full implementation

Collecting all the bits from the previous paragraphs, here is the full implementation of TwoLocusRouseFit. Note that the library implementation in bayesmsd.lib.TwoLocusRouseFit adds a few bells and whistles to this, most notably dimensionally independent parameters. Still, the below should provide a good baseline for implementing your own fits.

[1]:
import numpy as np
from scipy.special import erfc

import bayesmsd
from noctiluca.analysis import MSD

class TwoLocusRouseFit(bayesmsd.Fit):
    def __init__(self, data, motion_blur_f=0):
        super().__init__(data)
        self.motion_blur_f = motion_blur_f

        self.ss_order = 0

        self.parameters = {
            'log(σ²)' : bayesmsd.Parameter(bounds=(-np.inf, np.inf)),
            'log(Γ)'  : bayesmsd.Parameter(bounds=(-np.inf, np.inf)),
            'log(J)'  : bayesmsd.Parameter(bounds=(-np.inf, np.inf)),
            'log(τ)'  : bayesmsd.Parameter(bounds=(-np.inf, np.inf)),
        }

        self.parameters['log(τ)'].fix_to = TwoLocusRouseFit.fix_tau

        self.constraints = []

    @staticmethod
    def fix_tau(params):
        return 2*(params['log(J)']-params['log(Γ)']) - np.log(np.pi)

    def params2msdm(self, params):
        sigma2 = np.exp(params['log(σ²)'])
        Gamma  = np.exp(params['log(Γ)'])
        J      = np.exp(params['log(J)'])
        tau    = np.exp(params['log(τ)'])

        @bayesmsd.deco.MSDfun
        @bayesmsd.deco.imaging(noise2=sigma2,
                               f=self.motion_blur_f,
                               alpha0=0.5,
                              )
        def msd(dt, G=G, J=J, tau=tau):
            return (  2*G*np.sqrt(dt)*(1-np.exp(-tau/dt))
                    + 2*J*erfc(np.sqrt(tau/dt))   )

        return self.d*[(msd, 0)]

    def initial_params(self):
        e_msd = MSD(self.data) / self.d

        J = np.nanmean(np.concatenate([traj[:]**2 for traj in data],
                                      axis=0,
                                     ))
        G = np.nanmean(e_msd[1:5]/np.sqrt(np.arange(1, 5)))
        sigma2 = e_msd[1]/2

        return {
            'log(σ²)' : np.log(sigma2),
            'log(Γ)'  : np.log(G),
            'log(J)'  : np.log(J),
        }

    # logprior is uniform by default, so can just omit it