Analysis Workflow

This guide details the internal workflow of Ensemble Analyzer, illustrating how data flow from the initial input through the refinement pipeline to the final property generation.

1. Input & Initialization

The workflow begins by parsing the command-line arguments and initializing the core managers.

1.1 Data Loading

The launch.py entry point triggers the loading phase:

  • Ensemble Loading: The geometry file (e.g., .xyz) is parsed into a list of Conformer objects.

  • Protocol Loading: The JSON protocol is deserialized into a list of Protocol objects, defining the sequence of computational steps (e.g., Optimization \(\to\) Frequency \(\to\) Single Point).

  • Configuration: Global settings (temperature, CPU count, solvent models) are loaded into the CalculationConfig object.

1.2 Initial Analysis

Before starting the refinement loop, if the ensemble contains sufficient structures (\(N > 30\)), an initial Principal Component Analysis (PCA) is performed on the input geometries to visualize the starting conformational space coverage.


2. The Refinement Loop

The core logic is handled by the CalculationOrchestrator, which iterates through each step defined in the protocol.json. For each protocol step, the ProtocolExecutor performs the following operations:

2.1 Quantum Mechanical Calculations

For every active conformer that has not yet been calculated at the current level:

  1. Input Generation: The CalculationExecutor generates input files for the external QM engine (ORCA or Gaussian).

  2. Execution: The calculation is run (Optimization, Frequency, or Single Point).

  3. Parsing: Results (Energy, Geometry, Dipoles, Rotational Constants, Vibrational Frequencies) are parsed and stored in the EnergyStore of the conformer.

  4. Checkpointing: The CheckpointManager saves the state atomically after every calculation to prevent data loss.

2.2 Pruning Stage

After the calculation phase, the PruningManager filters the ensemble to remove high-energy structures and geometrically redundant conformers. This step is critical to reduce computational cost for subsequent, more expensive steps.

The pruning logic consists of two main filters:

  1. Energy Window Filtering: Conformers with a relative energy above the threshold (\(\Delta E > \text{thr}G_\text{max}\)) are immediately deactivated.

    • Parameter: thrGMAX (defined in threshold.json or protocol).

  2. Geometric Filtering (Duplicate Removal): Instead of computationally expensive RMSD alignments, EnAn uses Rotational Constants (\(B\)) and Electronic Energy (\(E\)) as descriptors to identify duplicates. Two conformers \(i\) and \(j\) are considered identical if: $\(|\Delta E_{ij}| < \text{thrG} \quad \land \quad |\Delta B_{ij}| < \text{thrB}\)$

    • Parameters: thrG (Energy tolerance), thrB (Rotational constant tolerance).

    • Validation: For logged duplicates, an RMSD based on the Euclidean Distance Matrix (EDM) eigenvalues is calculated for verification.

Note: Pruning can be disabled for specific steps by setting "no_prune": true in the protocol.

2.3 Clustering & Analysis

If clustering is enabled in the protocol ("cluster": true or specific integer), the ClusteringManager performs an unsupervised structural analysis to group conformers and optionally reduce the ensemble.

The workflow utilizes invariant features to avoid coordinate alignment issues:

  1. Feature Extraction: The eigenvalues of the Euclidean Distance Matrix (EDM) are computed for each conformer. These are invariant to translation and rotation.

  2. PCA (Principal Component Analysis): Dimensionality reduction is applied to the EDM eigenvalues.

  3. K-Means Clustering:

    • If a specific number of clusters is provided, K-Means is run directly.

    • If set to auto, the optimal number of clusters (\(k\)) is determined via Silhouette Score analysis (scanning \(k=2\) to \(k=N_{conf} \times 0.8\)).

  4. Ensemble Reduction: The ensemble is reduced by retaining only the representative conformer (lowest energy) from each cluster.

2.4 Spectral Generation

The final stage of a protocol step involves generating continuous spectra from discrete transitions. This is handled by the graph module.

  • Vibronic Spectra (IR, VCD): Convolved using Lorentzian functions.

  • Electronic Spectra (UV-Vis, ECD): Convolved using Gaussian functions.

Population Weighting: All spectra are weighted based on the Boltzmann population of the conformers, calculated using the relative Gibbs Free Energy (\(\Delta G\)) at the specified temperature (\(T\)).

3. Finalization & Output

Once all protocol steps are completed, the CalculationOrchestrator finalizes the workflow:

3.1 Data Export

  • final_ensemble.xyz: A multi-structure XYZ file containing all surviving conformers, sorted by energy.

  • checkpoint.json: A complete state file allowing for restarts or post-processing analysis.

3.2 Comparative Plotting

The plot_comparative_graphs module automatically generates overlay plots (e.g., IR_comparison.png, UV_comparison.png). These plots visualize the evolution of the computed spectra across the different protocol levels (e.g., comparing the spectrum after SP vs OPT+FREQ), allowing for quick assessment of convergence and method dependence.

3.3 Reporting

A summary table is printed to the log (output.out), detailing:

  • Final energies (E, H, G) and ZPVE.

  • Boltzmann populations.

  • Total elapsed time and final retention rate.