Modeling cryo-EM structures in alternative states with AlphaFold2-based models and density-guided simulations - Communications Chemistry

Here, we present an approach (Fig. 2) that combines generative artificial intelligence (AI) methods and flexible fitting to refine protein structures based on sequence information and low- to medium-resolution cryo-EM data. Taking advantage of recent developments in neural network-based structure prediction18, we use stochastic subsampling of the multiple sequence alignment (MSA) depth in AlphaFold219 to generate a broader ensemble of potential starting structures, then run density-guided MD simulations from cluster representatives to select an optimized structure for a given experimental density. We demonstrate the applicability of this approach to three test cases, all membrane proteins that undergo conformational changes between experimentally resolved states. As reference, default density-guided simulations in GROMACS failed to accurately fit the structure built in one state to the density resolved in another. In contrast, using our new approach, we successfully resolve state-dependent differences including the bending of an individual helix, rearrangement of neighboring helices, and reformation of an entire domain, demonstrating applicability to a range of systems and conformational dynamics.

As a straightforward approach to incorporating AI-generated models with flexible fitting to determine challenging protein structures (Fig. 2 -- right), we first used stochastic subsampling of input MSA depth in AlphaFold2 to produce 1250 models for each of three protein test systems. To filter out substantially misfolded models, we prioritized those scoring better than (below) -100 on the basis of generalized orientation-dependent all-atom potential (GOAP) structure-quality scoring, as described in Methods. To identify a limited set of plausible models representative of the generated ensemble, we then aligned the filtered models to the structure in the already-known state, and clustered them by the k-means method on the basis of Cartesian coordinates. Finally, we used the model closest to each cluster centroid (cluster representative) for fitting to an experimental density, termed the target state.

After rigid-body alignment to the relevant density, we subjected each representative model to density-guided MD simulations. Across each simulation, we monitored the cross-correlation to the target map as a model-fitting metric, and selected the simulation with the highest mean cross-correlation for further analysis. The implementation of adaptive force scaling for fitting in GROMACS means the run will continue until the overall forces of the system are too large to be compatible with the integration time step, which can introduce distortions. To instead identify a balanced frame, we also monitor the GOAP score as a quality metric for structural geometry, as done in our previous study. After normalizing both the fitting and geometry metrics to [0, 1], the sum of the normalized GOAP score and cross-correlation was computed as a compound score for each simulation frame, with higher scores representing a combination of good fit (high correlation) and geometry (minimal penalty due to overfitting). We selected the frame with the highest compound score as a final model.

To compare our approach to a more standard flexible fitting protocol, we performed parallel density-guided MD simulations for each experimental map using a single experimental structure of the same protein resolved under different conditions, termed the known state (Fig. 2 -- left). This scenario is common in investigation of challenging targets, in which a structure has been determined in one functional state, but deviates substantially from an experimental density representing a different state. As for many experimental structures, for two of our three test systems fitting required interpolation of loops not reported in the initial known model, a step rendered unnecessary in our generated ensemble approach described above. In both ensemble-based and single-model approaches, we refrained from applying secondary structure restraints. Such restraints are commonly used in density-guided simulations to prevent nonphysical structure disruptions, but can limit the accommodation of conformational transitions -- for example, kinking of a helix, rearrangement among secondary structure elements, or remodeling of an entire domain.

We tested our approach by modeling publicly available experimental densities for three membrane proteins: the calcitonin receptor-like receptor (CLR, EMD 20906, 2.3 Å resolution), L-type amino acid transporter (LAT1, EMD 30841, 3.4 Å resolution) and alanine-serine-cysteine transporter 2 (ASCT2, EMD 12142, 3.4 Å resolution) (Table 1). These structures were not present in the AlphaFold2 training set. We tested fitting to densities with or without a 1 Å Gaussian blur applied, and observed comparable convergence to target structures (Supplementary Fig. 1). To emphasize the applicability of our approach to medium-resolution maps, we focused subsequent analyses on fitting to the blurred densities. In each case, an initial known-state model for standard flexible fitting was chosen from an experimental structure determined in a different state than the target density (CLR, PDB ID 7KNT; LAT1, PDB ID 6IRS; ASCT2, PDB ID 6RVX). For CLR, the known-state structure differed from the target-state map in the bending of a single helix; for LAT1, it differed in the arrangement of two neighboring helices; for ASCT2, it exhibited a substantial conformational transition involving most of the transmembrane helices (Fig. 3 - above). Subsequent to density-guided simulations, we used the deposited structure associated with each target state (CLR, PDB ID 6UVA; LAT1, PDB ID 7DSQ; ASCT2, PDB ID 7BCQ) as a reference for fitting quality (Fig. 3 - below). Although such a structure would not be available in an anticipated application of our approach to a newly reconstructed density, it served as a valuable ground truth for validation of this work; we did not use the RMSD values as a criterion to select the final model from simulations.

For cases where no structure in any alternative state is available, an alternative approach could be to cluster the initial AI-generated ensemble using k-medoids on the basis of pairwise internal distances. In parallel trials of all three of our test cases using this technique (Supplementary Fig. 2), RMSD from the target structure converged to similar values as with our primary approach, indicating this can be a useful alternative in relevant systems. Both k-means and k-medoids clustering appeared robust in enabling selection of starting models for density-guided simulations that converge to low RMSD-to-target: in the cluster producing the final best frame for each test system, the five structures closest to the centroid exhibited a spread in RMSD-to-target of ≤ 0.5 Å (Supplementary Table 1). Both clustering approaches were also effective in seeding distinct conformations. Among the five best-correlated density-guided simulations in each system, the frames with highest compound score sample different parts of structure space: the three different systems exhibit spreads in RMSD-to-target from 1.0 Å to 6.5 Å (Supplementary Table 2). In all cases, when selecting the highest-scoring frame from the simulation with the best mean cross-correlation it produced the lowest RMSD-to-target (or within 0.15 Å from the lowest in one case), indicating that both clustering approaches were effective in distinguishing starting models capable of converging to the target.

On the other hand, AI-based de-novo prediction without AlphaFold2 was challenging in all test cases: predictions using the state-of-the-art tool ModelAngelo included only a subset (43-85%) of each protein (Supplementary Table 3), leaving portions of key transitional regions unbuilt (Supplementary Fig. 3). Substantial subsequent user input would be required to complete model building and fitting, in contrast to the final models from our approach, which covered the entire input sequences. Accordingly, we focused the remainder of our study on documenting our approach based on AI-driven generation of a substantial and diverse ensemble, followed by clustering and screening of density-guided simulations to identify representative models capable of fitting to the target state.

We first investigated the characteristic conformational transition of the G-protein coupled calcitonin receptor-like receptor (CLR), a component of both the calcitonin gene-related peptide receptor (CGRPR) and adrenomedullin (AM) receptor, involved in wound healing, vasodilation and other widespread physiological functions. A notable conformational transition between functional states of CLR transmembrane domain occurs in the 6th transmembrane helix (TM6, residues 329-354 in the target structure), which is relatively straight in the inactive state without G protein (Fig. 4A -- left), but bends upon activation (Fig. 4A -- right). This rearrangement remodels the intracellular binding interface at the N-terminal end of TM6, allowing it to interact with the Gs-protein α subunit. We used an inactive structure of CLR transmembrane domain, extracted from the larger CGRPR (PDB ID 7KNT), as the known state. The cryo-EM density from an active state of CLR, extracted from the larger AM receptor (EMD 20906), served as a target for model refinement.

The bulk of our generated ensemble fell within the cutoff for good quality geometry (Supplementary Fig. 4). Note that deviation from the target state was monitored as an assessment of the ensemble quality, but it is not used for the refinement. After filtering by GOAP score and k-means clustering in MDAnalysis, several (12/20) cluster representative models produced density-guided simulations with mean cross-correlations over 0.5 (Fig. 4C); a frame within 2.1 ns in the best-fit trajectory had the highest compound score (Fig. 4D, E) and was chosen as a final model.

We then compared our approach to standard flexible fitting starting from the known structure of CLR in an inactivated state. The final model generated in our approach deviated from the activated experimental structure (target state) by only 1.25 Å Cα RMSD (Table 2, Fig. 3A), and included the active-state kink (Fig. 4G), such that even local deviation in TM6 was limited to 1.6 Å (Supplementary Fig. 5). In contrast, the best model from standard density-guided simulations of the inactive structure (known state) deviated 2.39 Å from the target globally, and by 6.42 Å locally, even over 20 independent replicates (Table 2, Fig. 3A, Supplementary Fig. 6). Similarly, another flexible fitting method not based on GROMACS, cryo_fit, also did not result in good agreement between the fitted model and the target structure (global RMSD from the target of 3.27 Å after fitting, Supplementary Table 3). Compared to standard known-state fitting, our approach also produced moderately better model quality (-95.9 vs. -91.2 scaled GOAP, 0.23 vs. 1.36 Clash, 1.41 vs. 1.88 MolProbity scores) and fit to the target density (0.79 vs. 0.75 cross-correlation) (Table 2).

Visual inspection of fitted models indicated that the relative success of the generative modeling approach was largely attributable to TM6. The best-fit cluster representative from our generated ensemble, as well as the final model from subsequent density-fitting, included the TM6 kink associated with the active target state (Fig. 4G, Supplementary Fig. 7A, B). Local deviation in TM6 was limited to 1.6 Å relative to the target structure (Supplementary Fig. 5), and key side chains at the intracellular interface were correctly oriented, despite the absence of the Gαs protein binding partner during fitting (Supplementary Fig. 7A, C). Conversely, TM6 remained largely straight after density-guided simulations of the inactive known structure (Fig. 4F), such that local deviation from the target structure in this region was 6.42 Å (Table 2 and Supplementary Fig. 5). Attempted fitting of straight TM6 to the density resulted in partial unwinding of the helix N-terminus, positioning different residues at the G-protein interface than in the target structure (Supplementary Fig. 7B, D). Thus, generated ensemble models offer distinct advantages in the case of a local helix transition between functional states of CLR.

To test our approach in context of a larger conformational transition, we next applied it to LAT1, the light-chain component of a heteromeric amino-acid transporter. Together with its heavy-chain binding partner 4F2hc, LAT1 carries hormones and drugs across the blood-brain barrier, and is overexpressed in several cancers. A member of solute carrier family 7 (also designated SLC7A5), LAT1 has a 12-TM LeuT fold, and cycles between various inward, occluded, and outward states in a rocking bundle mechanism to transport solutes across the plasma membrane. We used a structure of LAT1 in the inward-open configuration, extracted from the heterodimeric complex (PDB ID 6IRS), as the known state (Table 1). As a target, we used a cryo-EM density (EMD 30841) thought to represent an intermediate between outward-occluded and outward-open states, taken from a complex with the inhibitor 3,5-diiodo-L-tyrosine (diiodo-Tyr). The principal distinction between the known and target states involved rearrangement of the kinked helices TM1 and TM6 (residues 47-80 and 240-265 respectively in the target structure) around the diiodo-Tyr binding site (Fig. 5A, Supplementary Fig. 8). The protein density was reported at lower overall resolution than that of CLR (3.4 Å vs. 2.3 Å), offering an additional test case for lower-quality data. We still added 1 Å Gaussian blur to the density.

The results indicated that our LAT1 target density was more challenging than CLR, but could be accurately modeled. Our generated ensemble for LAT1 included more models that exceeded the cutoff for reasonable quality geometry (Supplementary Fig. 4); on the other hand, LAT1 models deviated less than 4 Å (Cα RMSD) from either the known or target (PDB ID 7DSQ) structures, indicating the presence of conformations similar to both states (Fig. 5B). Cluster representatives were successfully fitted using density-guided MD simulations, with 8/20 representatives showing over 0.5 mean cross-correlation (Fig. 5C); a final model was selected within 1.8 ns in the best-fit simulation on the basis of compound score (Fig. 5D, E).

As in the case of CLR, our approach resulted in better fit and geometry metrics than standard density fitting of the known LAT1 structure. The final model based on our generated ensemble was within 1.44 Å Cα RMSD from the inhibited target structure (PDB ID 7DSQ), while the best model based on fitting the inward-open known structure deviated by 2.33 Å for GROMACS density fitting (Table 2, Fig. 3B) or 2.96 Å for cryo_fit tool. Our pipeline model also deviated by only 1.52 Å in the TM1/TM6 region, and was compatible with binding diiodo-Tyr, despite the absence of this inhibitor model and its density during fitting (Table 2, Supplementary Fig. 8). Conversely, the known-state fitted model deviated over 4.08 Å (Table 2) in the TM1/TM6 region, and oriented a Tyr residue to clash with the anticipated inhibitor pose. Compared to standard known-state fitting, our generative ensemble setup produced modestly better model quality (-103.6 vs. -100.2 scaled GOAP, 1.41 vs. 1.98 Clash, 1.34 vs. 2.01 MolProbity scores) and fit to the target density (0.59 vs. 0.52 cross-correlation). Thus, our approach facilitated modeling in the context of a multi-helix conformational transition, including a ligand binding site in a functional intermediate.

To test the applicability of our approach to a more distributed transition between known and target states, we investigated ASCT2, a homotrimeric member of solute carrier family 1 (SLC1A5) with an 8-TM GltPh fold. This transporter is widely expressed in human tissues, and upregulated in numerous cancers, making it an important target for drug development as well as biophysical characterization. ASCT2 is structurally and mechanistically distinct from LAT1, consisting of a scaffold domain comprising helices TM1-2 and TM4-5, and a transport domain comprising TM3 and TM6-8 along with two interhelical hairpins. Transport occurs by a one-gate elevator mechanism, in which the transport domain moves between inward and outward states relative to the scaffold. We used a structure of an ASCT2 monomer in an inward-open configuration, extracted from the trimeric complex (PDB ID 6RVX), as the known state (Table 1). As a target, we used a cryo-EM density (EMD 12142) determined to similar resolution as LAT1 (3.4 Å) in an outward-open configuration, taken from the homotrimeric complex in the presence of the inhibitor 4-(4-phenylphenyl)carbonyloxypyrrolidine-2-carboxylic acid (Lc-BPE). Conformational cycling in this system involved evident rearrangements among all helices and hairpins; accordingly, we considered the entire monomer as a transition region (Fig. 6A).

Despite its relatively distributed transition between known and target states, our approach was similarly successful in modeling ASCT2 as previous test cases. Our generated ensemble included an intermediate fraction with low quality geometry (Supplementary Fig. 4); however, models within 2 Å Cα RMSD of the known structure deviated at least 10 Å from the target (PDB ID 7BCQ) and vice versa, consistent with substantial rearrangements observed between the two states (Fig. 6B). Most (16/20) cluster representatives could be fit with mean cross-correlation to the target map over 0.5 (Fig. 6C), with the final model selected within 0.8 ns in the best-fit simulation (Fig. 6D, E).

Fitting of ASCT2 was substantially more accurate using our ensemble approach than starting only from the known structure (Fig. 3C). The final model based on our generated ensemble was within 3.65 Å Cα RMSD from the outward-open target structure, while the best model based on fitting the inward-open known structure deviated by 10.16 Å for GROMACS density fitting or 11.73 Å for cryo_fit tool (Table 2 and Supplementary Table 3). The overall RMSD value converged to similar values with or without blurring the original cryo-EM maps during our simulation in the ensemble approach Supplementary Fig. 1. When aligned on the scaffold domain (TM1-2, TM4-5), all transmembrane helices in our final model were largely superimposable with the target structure (Fig. 6F); for the fitted known state, rearrangements were evident in helices comprising the transport domain (Fig. 6G). As in previous cases, our approach also produced modestly better model quality (-99.8 vs. -95.8 scaled GOAP, 0.62 vs. 1.23 Clash, 1.06 vs. 1.63 MolProbity scores) and fit to the target density (0.63 vs. 0.57 cross-correlation). Thus, generated ensemble model fitting appeared particularly useful in the context of a conformational transition across multiple transmembrane helices.

Interestingly, the initial ensemble for ASCT2 included a model with a modestly better RMSD relative to the target structure (2.35 Å) than the final result from our k-means-based pipeline (3.65 Å) (Fig. 6), or from our alternative k-medoids clustering (3.46 Å, Supplementary Fig. 2). As shown in previous sections, this issue did not appear with CLR or LAT1 where the best member of the starting ensemble included a model with 1.37 Å and 1.44 Å RMSD, respectively. After running our pipeline, our final models deviated equally or less from our k-means-based pipeline (1.25 Å and 1.44 Å, respectively), and from our alternative k-medoids clustering (1.46 Å and 1.64 Å, respectively) (Figs. 4 and 5 and Supplementary Fig. 2). Although in the case of ASCT2 our clustering method appeared to exclude a particularly optimal model, this hit could not have been identified given our presumed advance knowledge of only the target density, not the target structure. Alternatively, we attempted to select models approximating the target structure directly from the filtered ensemble, prior to clustering or simulations, on the basis of cross-correlation to the target density; however, the best-correlated models for ASCT2 were visibly different, and exhibited relatively high RMSD (9.57-11.39 Å), with respect to the target structure (Supplementary Fig. 9). Thus, knowledge of the target density was insufficient to identify plausible models prior to clustering and simulations, supporting the overall utility of our approach.

Motivated by challenging characteristics particularly of the ASCT2 test case, we also evaluated alternative approaches to selecting a final model. Although GOAP is an established metric for model quality particularly for simulations in GROMACS, MolProbity is a common alternative in structural biology. Monitoring MolProbity vs. GOAP scores during density-guided simulations resulted in final models with similar RMSDs relative to the target structure (3.39 Å vs. 3.65 Å respectively), indicating either metric could be effectively applied. Perhaps more critically, our primary approach relied on a compound score based on cross-correlation and GOAP scores normalized within the best-correlated density-guided simulation; an alternative strategy would be to normalize each value across all simulations of a given system. For CLR and LAT1, such global normalization selected precisely the same frame as our primary approach (Supplementary Fig. 10). For ASCT2, global normalization selected a model from a different trajectory than our primary approach, with a modestly better GOAP score but worse cross-correlation. Notably, RMSD with respect to the target structure was 8.26 Å for the final model using global normalization, vs. 3.65 Å when normalizing within each trajectory, indicating that our approach yields an equivalent or better model in all our test cases.

Modeling cryo-EM structures in alternative states with AlphaFold2-based models and density-guided simulations - Communications Chemistry

POPULAR CATEGORY

misc

entertainment

corporate

research

wellness

athletics