Utils Module
Analysis utilities for molecular dynamics simulation post-processing.
Provides functions for statistical analysis, histogram/distribution conversion, PCA-based free-energy landscapes, block averaging, and data loading helpers built on top of NumPy, SciPy, and MDAnalysis.
- SciKit.utils.block_bootstrap(data, blocksize=1000, nsamples=1000, statistic=<function rms>)[source]
Estimate the standard deviation of a statistic via circular block bootstrap.
Accounts for temporal autocorrelation in MD time series by resampling contiguous blocks rather than individual frames. Uses the circular (wraparound) variant to avoid boundary effects at the end of the array.
- Parameters:
data (array_like) – 1-D time-ordered array of data points.
blocksize (int, optional) – Number of consecutive frames in each resampled block (default
1000). Should be larger than the autocorrelation time of the observable.nsamples (int, optional) – Number of independent bootstrap replicates to generate (default
1000).statistic (callable, optional) – Function applied to each bootstrap replicate to compute the quantity of interest. Must accept a 1-D array and return a scalar. Default is
rms(); other useful choices includenp.meanornp.std.
- Returns:
sd – Bootstrap standard deviation of
statisticacross all replicates, computed withddof=1.- Return type:
float
Notes
Each bootstrap replicate is built by drawing random start indices, extracting blocks of length
blocksizewith circular index wrapping (index % n), and concatenating until the sample reaches lengthn. The final sample is trimmed to exactlynframes.Choosing
blocksizeis the main tuning parameter: too small underestimates the error; too large increases variance of the estimator. A common heuristic is to set it to roughly 5–10× the autocorrelation time.Examples
>>> sd = block_bootstrap(rg_timeseries, blocksize=500, statistic=np.mean) >>> print(f'Bootstrap std of <Rg>: {sd:.4f} Å')
- SciKit.utils.block_mean(data, division)[source]
Estimate the mean and uncertainty of a time series using two-block averaging.
Discards an initial fraction of the data as equilibration, then splits the remaining production region into two equal blocks. The reported average is the mean of the two block means and the error is their standard deviation — a simple variance estimator that accounts for serial correlation in MD observables.
- Parameters:
data (array_like) – 1-D sequence of scalar values ordered along simulation time.
division (int) – Controls the equilibration cutoff: the first
1/divisionof the data is discarded. The remaining(1 - 1/division)fraction is split into two equal blocks for the error estimate. For example,division=5discards the first 20 % of frames.
- Returns:
average (float) – Mean of the two block averages.
error (float) – Standard deviation of the two block averages (half the difference between them), representing the statistical uncertainty.
Examples
>>> avg, err = block_mean(rg_timeseries, division=5) >>> print(f'Rg = {avg:.2f} ± {err:.2f} Å')
- SciKit.utils.get_overlap(x1, y1, x2, y2)[source]
Find the x values present in both datasets, and return the corresponding y values from each dataset at those shared x positions.
No interpolation is performed — only x values that literally appear in both x1 and x2 are included in the output. The result is independent of argument order: get_overlap(x1, y1, x2, y2) and get_overlap(x2, y2, x1, y1) will return the same x_overlap (though y1_overlap and y2_overlap will swap).
- Parameters:
x1 (array-like) – X-coordinates of the first dataset (assumed sorted ascending, no duplicates).
y1 (array-like) – Y-values corresponding to x1.
x2 (array-like) – X-coordinates of the second dataset (assumed sorted ascending, no duplicates).
y2 (array-like) – Y-values corresponding to x2.
- Returns:
x_overlap (np.ndarray) – X-values present in both x1 and x2, sorted ascending.
y1_overlap (np.ndarray) – Y-values from dataset 1 at x_overlap.
y2_overlap (np.ndarray) – Y-values from dataset 2 at x_overlap.
- Raises:
ValueError – If x1 and x2 share no common x values.
Example
>>> x1 = [1, 2, 3, 4, 5] >>> y1 = [10, 20, 30, 40, 50] >>> x2 = [3, 4, 5, 6, 7] >>> y2 = [300, 400, 500, 600, 700] >>> x, y1_out, y2_out = get_overlap(x1, y1, x2, y2) >>> # x = [3, 4, 5], y1_out = [30, 40, 50], y2_out = [300, 400, 500]
- SciKit.utils.mda_pca(psf, dcd, sel, align=True)[source]
Perform PCA on an MD trajectory and return the first two principal components.
Loads a CHARMM PSF + DCD trajectory pair, selects atoms, and runs MDAnalysis PCA. Returns the per-frame projections onto PC1 and PC2 along with their individual explained-variance fractions.
- Parameters:
psf (str) – Path to the CHARMM PSF topology file.
dcd (str) – Path to the DCD trajectory file.
sel (str) – MDAnalysis atom-selection string applied to both alignment and PCA (e.g.,
'backbone','name CA').align (bool, optional) – If
True(default), align all frames to the first frame before computing the covariance matrix.
- Returns:
pc1 (numpy.ndarray, shape (n_frames,)) – Per-frame projection onto the first principal component.
pc2 (numpy.ndarray, shape (n_frames,)) – Per-frame projection onto the second principal component.
var1 (float) – Explained variance fraction of PC1 (value in [0, 1]).
var2 (float) – Explained variance fraction of PC2 (value in [0, 1]).
Notes
var1is taken directly frompc.cumulated_variance[0].var2is the incremental variance of PC2, computed ascumulated_variance[1] - cumulated_variance[0]. Multiply by 100 to obtain a percentage.Examples
>>> pc1, pc2, var1, var2 = mda_pca('system.psf', 'traj.dcd', 'backbone') >>> print(f'PC1: {var1*100:.1f}% PC2: {var2*100:.1f}%')
- SciKit.utils.mylinregress(xs, ys, xrange=None)[source]
Perform linear regression on (x, y) data, optionally restricted to a specific x-range.
- Parameters:
xs (array-like) – Independent variable values.
ys (array-like) – Dependent variable values, same length as xs.
xrange (tuple of (float, float), optional) – If provided, only data points with x in [xrange[0], xrange[1]] are plotted.
- Returns:
r_value (float) – Correlation coefficient of the linear fit.
x_fit (np.ndarray) – X values for the fitted line, either spanning the full range of xs or limited to xrange if provided.
y_fit (np.ndarray) – Corresponding y values of the fitted line at x_fit.
- SciKit.utils.myload(filename, *args, **kwargs)[source]
Load a whitespace-delimited text file and unpack each column as a separate array.
Wraps
numpy.loadtxtand returns each column as an individual element of a tuple, making it convenient to unpack multi-column data files with a single assignment. Up to eight trailing0sentinels are appended so that callers can safely unpack into more variables than there are columns without raising aValueError.- Parameters:
filename (str or path-like) – Path to the text file to load.
*args – Additional positional arguments forwarded to
numpy.loadtxt(e.g.,skiprows).**kwargs – Additional keyword arguments forwarded to
numpy.loadtxt(e.g.,delimiter,usecols,dtype).
- Returns:
A tuple whose first
nelements are 1-Dnumpy.ndarrayobjects, one per column in the file, followed by eight integer zeros as padding. The total length is thereforen + 8.- Return type:
tuple
Notes
The eight trailing zeros allow patterns like:
col1, col2, *_ = myload('data.txt')
without the caller needing to know the exact column count. However, the padding means
len(myload(...))is alwaysn_cols + 8regardless of the actual file width.Examples
>>> time, rg, energy, *_ = myload('observables.dat', skiprows=1) >>> plt.plot(time, rg)
- SciKit.utils.pca2d(axs, psf, dcd, sel, num_bin=100, align=True, cmap='viridis')[source]
Compute and plot a 2-D PCA free-energy landscape in a single call.
Convenience wrapper that combines
mda_pca(),pca2fe(), andplt_pca()into one step. Loads the trajectory, runs PCA, converts to a free-energy surface, and renders the result directly ontoaxs. Prefer this function for quick exploratory plots; use the individual functions when you need intermediate results (e.g.,pc1/pc2for further analysis).- Parameters:
axs (matplotlib.axes.Axes) – Target axes on which the landscape will be drawn.
psf (str) – Path to the CHARMM PSF topology file.
dcd (str) – Path to the DCD trajectory file.
sel (str) – MDAnalysis atom-selection string (e.g.,
'backbone').num_bin (int, optional) – Number of bins along each PCA axis (default
100).align (bool, optional) – If
True(default), align frames before PCA.cmap (str or matplotlib.colors.Colormap, optional) – Colormap for the free-energy surface (default
'viridis').
Examples
>>> fig, axs = general_temp(1, 1, 6, 5) >>> pca2d(axs, 'system.psf', 'traj.dcd', 'backbone', num_bin=80)
- SciKit.utils.pca2fe(pc1, pc2, num_bin=100)[source]
Convert 2-D PCA projections to a free-energy landscape in units of kBT.
Bins
(pc1, pc2)pairs into a 2-D histogram, normalises to a PMF, and applies the Boltzmann inversionF = -ln(P). The global minimum of the finite free-energy values is shifted to zero.- Parameters:
pc1 (array_like, shape (n_frames,)) – Per-frame projection onto the first principal component.
pc2 (array_like, shape (n_frames,)) – Per-frame projection onto the second principal component.
num_bin (int, optional) – Number of bins along each axis of the 2-D histogram (default
100).
- Returns:
fe – Free-energy surface in units of kBT, with the global minimum set to 0. Bins with zero occupancy contain
inf.- Return type:
numpy.ndarray, shape (num_bin, num_bin)
Notes
The local minimum (highest-occupancy bin) coordinates in data space are computed internally but not currently returned. Retrieve them by inspecting
xedges/yedgesif needed in future extensions.Examples
>>> fe = pca2fe(pc1, pc2, num_bin=80) >>> plt.imshow(fe.T, origin='lower')
- SciKit.utils.plt_pca(axs, fe, var1=None, var2=None, cmap='viridis', contour=False, levels=5)[source]
Plot a 2-D free-energy landscape from PCA projections.
Displays the free-energy array as a colour-mapped image with an attached colorbar. Axis tick marks are suppressed so that the plot reflects relative rather than absolute PC coordinates. Optionally overlays iso-energy contour lines.
- Parameters:
axs (matplotlib.axes.Axes) – Target axes on which the landscape will be drawn.
fe (numpy.ndarray, shape (N, N)) – Free-energy surface in units of kBT, as returned by
pca2fe(). Transposed internally before display.var1 (float or None, optional) – Explained variance fraction of PC1 (value in [0, 1]). If provided together with
var2, percentage labels are added to the axis. DefaultNoneomits the percentage.var2 (float or None, optional) – Explained variance fraction of PC2 (value in [0, 1]). Default
Noneomits the percentage.cmap (str or matplotlib.colors.Colormap, optional) – Colormap for the free-energy surface (default
'viridis').contour (bool, optional) – If
True, overlay black iso-energy contour lines (defaultFalse).levels (int, optional) – Number of contour levels drawn when
contour=True(default5).
Examples
>>> fig, axs = general_temp(1, 1, 6, 5) >>> fe = pca2fe(pc1, pc2) >>> plt_pca(axs, fe, var1=0.35, var2=0.12, contour=True, levels=8)
- SciKit.utils.remove_zero(xs, ys)[source]
Find the x-range that spans the non-zero support of a distribution.
Scans
ysfrom both ends to locate the first and last non-zero values, then returns the correspondingxspositions offset by ±5 index positions to retain a small margin around the support.- Parameters:
xs (array_like) – 1-D array of x-axis values (e.g., bin centers).
ys (array_like) – 1-D array of y-axis values (e.g., PDF or PMF). Must be the same length as
xs.
- Returns:
x_nozero (1-D array) – x values within the non-zero support of the distribution.
y_nozero (1-D array) – Corresponding y values within the non-zero support.
Examples
>>> x, y = scatter2hist(distances, num_bin=50, styles='pdf') >>> x_new, y_new = remove_zero(x, y) >>> ax.plot(x_new, y_new)
- SciKit.utils.rms(data)[source]
Compute the root-mean-square (RMS) value of an array.
- Parameters:
data (array_like) – 1-D or N-D array of numeric values.
- Returns:
RMS value:
sqrt(mean(data**2)).- Return type:
float
Examples
>>> rms([1, -2, 3, -4]) 2.7386...
- SciKit.utils.scatter2hist(point_list, num_bin, styles)[source]
Convert a 1-D array of scatter points to a histogram or smooth distribution.
Computes a normalized histogram from raw data and returns either a stepped distribution (suitable for
ax.plotstep-style rendering) or a cubic/quadratic spline-smoothed curve, depending onstyles. Zero-valued padding is appended to both ends so the distribution reaches zero at the boundaries.- Parameters:
point_list (array_like) – 1-D sequence of scalar data points to histogram.
num_bin (int) – Number of histogram bins. For smoothed styles, the output x-grid is upsampled to
num_bin * 10points.styles ({'pdf', 'pmf', 's_pdf', 's_pmf'}) –
Output type:
'pdf': Stepped probability density function (area under curve ≈ 1).'pmf': Stepped probability mass function (sum of values ≈ 1).'s_pdf': Cubic spline-smoothed PDF curve.'s_pmf': Quadratic spline-smoothed PMF curve.
- Returns:
x (numpy.ndarray) – Bin-center positions (or smoothed x-grid for
's_pdf'/'s_pmf'), with one extra zero-padded point prepended and appended.y (numpy.ndarray) – Corresponding PDF or PMF values, zero-padded at both ends.
- Raises:
ValueError – If
stylesis not one of the four accepted strings.
Notes
The smoothed variants use
scipy.interpolate.make_interp_splinewith spline orderk=3(PDF) ork=2(PMF). Negative spline artefacts near the zero-padded boundaries are possible for smallnum_binvalues.Examples
>>> x, y = scatter2hist(distances, num_bin=50, styles='pdf') >>> ax.plot(x, y)
>>> x_smooth, y_smooth = scatter2hist(distances, num_bin=50, styles='s_pdf') >>> ax.plot(x_smooth, y_smooth)
- SciKit.utils.stack_jagged(arrays, model='min')[source]
Stack a list of 1D arrays as columns, aligning on the overlap.
- Parameters:
arrays (list of array-like) – Input arrays or lists, can have different lengths.
model ({'min', 'max'}, optional) – ‘min’ : trim all arrays to the shortest length (default). ‘max’ : pad shorter arrays with NaN to match the longest length.
- Returns:
2D array of shape (min_len, n_arrays) if model=’min’, or (max_len, n_arrays) if model=’max’.
- Return type:
np.ndarray