Young KRAB-zinc finger gene clusters are highly dynamic incubators of ERV-driven genetic heterogeneity in mice - Nature Communications


Young KRAB-zinc finger gene clusters are highly dynamic incubators of ERV-driven genetic heterogeneity in mice - Nature Communications

De novo Mus musculus assemblies reveal a much larger KZFP gene cluster at the end of Chromosome 4

KZFP genes are organized in genomic clusters on several chromosomes in the Mus musculus genome (Fig. 1b). While some KZFP gene clusters primarily comprise old genes shared across species, others, such as the double-cluster on Chromosome 12 (Chr12), harbor genes entirely unique to mice. A few clusters, like those at the end of Chr2 and Chr4, contain at least one KZFP gene shared with other rodent species while mostly encoding KZFP genes unique to mice. The KZFP cluster at the distal end of Chr4 stands out for several reasons. This KZFP gene cluster appears to be specific to the Murinae clade, likely originating in the last common ancestor of rats and mice (Fig. 1c). Despite the conserved syntenic block defined by Tnfrsf8 and Miip genes flanking this locus, the cluster is absent in other closely related muroids, such as gerbils (Fig. 1c, Supplementary Fig. 1b). Comparative analysis between mouse and rat reveals that this region expanded significantly in the mouse lineage, acquiring multiple new KZFP genes. Finally, this cluster has been repeatedly implicated in studies mapping modifier loci for variably regulated ERVs across mouse strains, making it an excellent model to explore the evolution and diversification of KZFP gene clusters. Extensive analyses of this locus have been hindered by persistent sequence gaps in this region even in the current GRCm39 reference assembly, the size and content of which have remained largely undefined (Fig. 1d, Supplementary Fig. 1c).

Thus, we generated de novo genome assemblies to fill the gaps in this locus and other young KZFP gene clusters for two widely used laboratory mouse strains C57BL/6 J (BL6J) and 129S1/SvImJ (129S1), by combining the high sequencing accuracy of PacBio HiFi sequencing with the ultralong reads of ONT sequencing (Fig. 2a). While our primary goal was to resolve gaps in KZFP gene cluster loci, the resulting assemblies achieved high overall quality as assessed by the Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis, and with several contigs spanning entire chromosomes (Fig. 2b, c, Supplementary Fig. 2a, b). To refine the assemblies, we retained and strand-corrected contigs aligned to known locations in the GRCm39 reference, removing unplaced and redundant nested contigs (Supplementary Fig. 2a, b).

Focusing on the KZFP gene cluster on Chr4, we uncovered over 2.5 Mb of previously unassembled sequence in BL6J, resolving five gaps in the GRCm39 reference. This expanded the total cluster size from 2.8 Mb to 5.4 Mb and revealed new regions of high sequence similarity within this locus (Fig. 2d). De novo transcriptome assembly, combining published RNA-seq data with newly generated PacBio IsoSeq data, allowed us to curate and complete the annotation of this locus. We identified 38 previously unannotated KZFP genes and resolved the full sequence of three genes (Zfp986, Zfp993, and Gm21411) whose zinc finger arrays were incomplete due to sequence gaps (Fig. 2d, Supplementary Data 1).

The identification of novel sequence nearly doubling the size of this locus underscores a key limitation of interpreting genomic datasets using the GRCm39 reference assembly, where short-read mapping of ChIP-seq and RNA-seq experiments can result in artificial read pileups when reads derived from gap sequences are misaligned to the highly similar sequences available in the reference assembly (Supplementary Fig. 2c, d).

Interestingly, the Chr4 KZFP gene cluster is even larger in the 129S1 strain than in BL6J, spanning 6.9 Mb and containing 83 KZFP genes (Fig. 2e, Supplementary Data 1). Accuracy of the de novo assemblies at the Chr4 KZFP gene cluster (as well as at another double cluster on Chr12) was confirmed by inspecting the alignment of strain specific long ONT reads ( > 100 kb) over these loci in the corresponding assembly (Supplementary Fig. 3).

Complete telomere-to-telomere (T2T) assemblies for C57BL/6 J and CAST/EiJ (CAST) mouse strains have recently become available, allowing us to extend our comparative analysis to a third mouse strain. Applying a similar gene annotation and curation strategy to the CAST Chr4 KZFP gene cluster, we identified 86 KZFP genes in a 6.6 Mb region (Supplementary Data 1). Importantly, despite differences in sequencing and assembly strategies, our BL6J assembly is structurally identical to the T2T C57BL/6J assembly at the Chr4 KZFP gene cluster, as well as at other young clusters (Supplementary Fig. 4). This suggests that the large differences in KZFP gene cluster size observed between the three mouse strains are due to strain specific, rather than individual, locus divergence.

Sequence comparison of the BL6J, 129S1, and CAST Chr4 KZFP gene cluster revealed that this locus is fundamentally heterogeneous between the three mouse strains, beyond the size difference (Fig. 3a, b). While the beginning and end of the cluster are conserved, most of the locus is rearranged, with some regions of sequence similarity scrambled across the cluster and variably duplicated in the three strains. While the overall sequence comparison hints to high divergence of this locus in the three mouse strains, detailed comparison of the curated KZFP gene annotation in this cluster further revealed a disparate KZFP gene repertoire (Fig. 3c, d). Fingerprint amino acids - corresponding to the amino acids at the positions -1, +2, +3, and +6 within each zinc finger according to helical nomenclature - are major determinants of the KZFP DNA binding specificity as they directly contact the target nucleotide sequence. Thus, we focused on the arrays of fingerprint amino acids of the coding KZFP genes and identified the repertoire of distinct fingerprint arrays in the Chr4 cluster of each mouse strain (Supplementary Data 2). Several fingerprint arrays were shared across multiple KZFPs and we identified 38, 47, and 45 distinct Chr4 KZFP fingerprint arrays in the BL6J, 129S1, and CAST strains, respectively. Only a small number of fingerprint arrays were found to have an exact match across different strains, and instead, the majority of fingerprint arrays are unique to the individual strains (Fig. 3d). Even fingerprint arrays shared between strains often have different representation, with varying numbers of KZFP copies in each strain (Supplementary Data 2). Our analysis indicates that the Chr4 KZFP gene cluster has undergone parallel evolution in the BL6J, 129S1, and CAST mouse strains, marked by independent duplication events within the locus. This conclusion is further supported by the distinct patterns of self-identical sequences observed in the locus across the three strains (Fig. 3e).

Since the Mus musculus Chr4 KZFP gene cluster locus is much larger than the corresponding locus in rat, we investigated whether this cluster expansion is unique to Mus musculus or if it also occurred in other mouse species. To explore this, we generated a de novo assembly for Mus spretus using a partially inbred strain (SPRET2) and also extended our analysis to a de novo assembly of Mus pahari. These assemblies completely spanned the locus corresponding to Chr4 KZFP gene cluster, as well as other KZFP gene clusters analyzed in this study, allowing comparison of gapless sequences. Sequence comparisons among Rattus norvegicus, Mus pahari, Mus spretus, and the three Mus musculus strains reveal that the independent expansion of the Chr4 KZFP gene cluster is a common feature of the mouse lineage compared to rat. However, the Mus musculus strains exhibit substantially larger cluster sizes than those observed in Mus spretus and Mus pahari (Fig. 3f).

To explore modes of rapid locus expansion, we traced the regions of segmental duplications within the Chr4 KZFP gene cluster. Due to the highly repetitive nature of this locus, many short sequence stretches exhibit high similarity, as revealed by various strategies for identifying self-identical sequences (Fig. 2d, e, Fig. 3e). However, detailed annotation of the cluster's gene content in each strain showed that several genes shared identical or highly similar fingerprint arrays (Supplementary Data 2). This raised the question of whether these genes duplicated independently or as groups. To investigate this, we compared the sequences of the whole 3' exon (including both the portion encoding for the zinc finger array and the 3'UTR) of all KZFP genes within the BL6J cluster. This strategy enabled us to compare the underlying DNA sequence of the exon that contributes the most to the individual KZFP gene identity, while disregarding their actual coding potential. Limiting our analysis to the coding region would have biased the comparison, as truncated arrays caused by isolated point mutations would appear highly dissimilar, even though the sequences are nearly identical (Supplementary Fig. 5). This analysis allowed us to identify similarity relationships between all the KZFP genes within the BL6J Chr4 cluster (Fig. 4a). By examining the gene positions within the cluster alongside their sequence similarity, we uncovered gene blocks - groups of genes with high similarity that were located in different regions of the cluster. These gene blocks ranged in size, containing between 3 and 7 genes (Fig. 4b). Interestingly, we also found evidence of partial duplications suggestive of multiple rounds of interstitial segmental duplications.

Collectively, this analysis revealed that the Chr4 KZFP gene cluster is a highly recombinogenic locus and that large segmental duplications have been responsible for the rapid expansion of the Mus musculus locus.

We next sought to identify features that may have contributed to the recombination and segmental duplications driving the expansion and divergence of the Chr4 KZFP gene cluster in the mouse lineage. Several molecular mechanisms can be responsible for segmental duplications. It has been observed that recombination and segmental duplication events in human are more frequent in genomic regions close to the subtelomeres. Given the proximity of the Chr4 KZFP gene cluster to the telomeric region, we investigated whether its chromosomal position might have played a crucial role in its recombinogenic potential. To this end, we analyzed two additional young KZFP gene clusters: one on Chr2 located at a similar distance from the telomere as the Chr4 KZFP gene cluster, and a non-telomeric KZFP gene double-cluster on Chr12 located much closer to the centromere (Fig. 1b). Like the Chr4 cluster, the KZFP gene cluster at the end of Chr2 had several gaps in both the GRCm39 reference assembly and the 129S1.v3 assembly (Supplementary Fig. 6a-c). Surprisingly, we observed that after filling the gaps, the BL6J locus is slightly smaller than initially predicted in the GRCm39 reference. A comparison of this KZFP gene cluster across mouse strains revealed that the locus is nearly identical between BL6J and 129S1 Mus musculus strains but shows dramatic expansion in the CAST strain (Supplementary Fig. 6d, e).

In contrast, the KZFP gene double-cluster locus on Chr12 exhibited pronounced heterogeneity among the three examined Mus musculus strains (Supplementary Fig. 7). Notably, this double-cluster is specific to the mouse lineage and is entirely absent in rat and other species. Interestingly, this locus not only underwent independent expansion in mice, but each of the two clusters expanded independently as well, as demonstrated by the varying sizes of the clusters in different mouse strains and species (Supplementary Fig. 7d, e). Furthermore, the CAST strain displayed evidence of recombination between the two KZFP clusters within this locus, resulting in an inversion detectable by the change in orientation of the non-KZFP genes located between the clusters. These findings indicate that the chromosomal position of the KZFP gene clusters does not necessarily dictate the recombinogenic potential of these loci: the Chr2 KZFP gene cluster, which is located near the telomere, shows little heterogeneity between BL6J and 129S1 mouse strains, while the Chr12 double-cluster, positioned far from the telomere, exhibits high heterogeneity across all three mouse strains analyzed. Moreover, inbreeding of Mus musculus strains also does not appear to be the primary driver of KZFP gene cluster expansion. This is demonstrated by the striking expansion of the Chr12 double-cluster in Mus spretus.

Since meiotic recombination can also promote the emergence of structural variants, we next investigated whether the young KZFP gene clusters are enriched for meiotic hotspots, which could explain the frequent recombination events and locus heterogeneity observed between mouse strains. To address this, we looked at available datasets from BL6J and CAST mouse testes for PRDM9 binding, which determines the position of meiotic double strand breaks, and for DMC1 binding to single-stranded DNA (SSDS), which indicates loci undergoing DNA break repair during meiosis (Supplementary Fig. 8). Young KZFP gene clusters in the BL6J and CAST strains do not appear to be significantly enriched for PRDM9 binding sites or DMC1-bound single-stranded DNA accumulation during spermatogenesis, which are rather excluded from these loci, compared to adjacent genomic regions. This is consistent with low meiotic recombination frequency observed at zinc finger gene and repeat loci also in human. While low-frequency meiotic recombination events cannot be entirely excluded, meiotic recombination does not appear to be a major driver of the sequence divergence observed at these loci.

Since the KZFP gene clusters on Chr4, Chr2, and Chr12 are all young clusters, we also examined two evolutionarily older KZFP gene clusters: one located on Chr6 and the other on Chr7 in Mus musculus. These older clusters harbor KZFP genes that are conserved across mammals and are located on the respective chromosomes at positions similar to the Chr12 double-cluster locus (Fig. 1b). Both older clusters (Chr6 and Chr7) exhibit remarkable sequence similarity not only between the mouse strains and species analyzed but also with rat (Supplementary Fig. 9a, c). Furthermore, these older clusters display a much lower load (Chr6) or even absence (Chr7) of self-identical sequences (Supplementary Fig. 9b, d), compared to the younger KZFP gene clusters. While large stretches of high sequence similarity in the young KZFP gene clusters are likely a consequence of recently duplicated gene blocks, young KZFP gene clusters also contain an abundance of small self-identical repetitive sequences compared to the older and more conserved KZFP gene clusters.

Repetitive sequences and TEs have been shown to contribute to structural variations, including large deletions, segmental duplications and chromosomal rearrangements and previous studies have demonstrated that mouse KZFP gene clusters display TE enrichment, particularly enrichment of ERVs. Thus, we hypothesized that the expansion and divergence of the young KZFP gene clusters might correlate with TE enrichment at these loci in the mouse lineage compared to rat. To address this, we compared the TE content across the genomes of rat, Mus pahari, Mus spretus, and the three Mus musculus strains. Genome-wide, we observed only minor differences in the overall abundance of different TE classes, with a subtle increase in LTR elements in the mouse lineage compared to rat (Supplementary Fig. 10a). However, when analyzing the TE content specifically within the Chr4 KZFP gene cluster, we found that the rat cluster is heavily enriched in LINE elements (45% of the cluster), whereas in mice, this locus acquired a higher LTR load (Supplementary Fig. 10c). A similar pattern was observed for the Chr2 KZFP gene cluster (Supplementary Fig. 10b). The LTR load was particularly striking at the Chr12 KZFP gene double-cluster locus in Mus spretus and in the three analyzed Mus musculus strains. This locus, which underwent the largest expansion in Mus spretus, is absent in rat and quite small in Mus pahari, where it displays a marked LINE-rich composition compared to Mus spretus and Mus musculus (Supplementary Fig. 10d). This trend is less prominent in the older KZFP gene clusters located on Chr6 and Chr7 (Supplementary Fig. 10e, f). Further analysis of the enrichment of individual LTR families at each examined KZFP gene cluster revealed that distinct mouse specific ERVs have colonized different young clusters, particularly in Mus musculus and Mus spretus compared to Mus pahari and rat (Supplementary Fig. 10g, Supplementary Data 3). In contrast, older KZFP gene clusters in mice displayed enrichment for many fewer LTR elements, which were mostly shared with rat. Finally, analysis of the individual LINE families at the Chr4 KZFP gene cluster locus revealed that the LINE families with the highest representation at this cluster in mice are conserved at the corresponding rat locus (Supplementary Data 3), suggesting that this cluster might have originated as LINE-rich region and rapidly expanded in the mouse lineage together with the infiltration of new, lineage-specific ERVs. Close inspection of the Chr4 KZFP gene cluster revealed examples of chimeric LINE and ERVs (Fig. 4e, f, Supplementary Fig. 11). These chimeric elements have been described as the result of TE mediated non-allelic homologous recombination in which the initial retrotransposition event can also be the cause of the initial DNA breaks, and often associated with the establishment of structural variants, and we have found examples of these recombination scars at boundaries of recombined and duplicated portions of the BL6J Chr4 KZFP gene cluster (Supplementary Fig. 11). Altogether, our analysis suggests that the infiltration of new ERVs in the young KZFP gene clusters in the mouse lineage likely increased the frequency of non-allelic homologous recombination events, thus providing a possible mechanism whereby ERVs directly promote the expansion and divergence of young KZFP gene clusters.

Interestingly, while we observed increased content of mouse ERVs in several young KZFP gene clusters, we also found that the large duplicated gene blocks in the Chr4 KZFP gene cluster harbored copies of the TEs that had integrated prior to the segmental duplication events (Fig. 4c, d, Supplementary Fig. 11a). This suggests that the rapid gain of TE enrichment in KZFP gene clusters was driven by the segmental duplication events, rather than by retrotransposition events alone. Supporting this interpretation, we observed that despite the much smaller size of the Mus spretus Chr4 KZFP gene cluster locus relative to its Mus musculus counterpart, the overall enrichment for several ERV families remains similar between the two species (Supplementary Fig. 10g). This implies that the ERV load increased proportionally with locus expansion during duplication events in the mouse lineage. Further evidence for gain of TE enrichment independent of transposition comes from the enrichment of DNA transposons, prominent at the Chr12 KZFP double-cluster locus (Supplementary Fig. 10h, Supplementary Data 3). Because DNA transposons are mostly inactive fossil elements in mammals - and even the few active ones replicate via a cut-and-paste mechanism rather than copy-and-paste - their increased enrichment suggests that pre-existing copies were duplicated during segmental duplication of the surrounding genomic regions.

To better understand these patterns, we focused on the BL6J Chr4 KZFP gene cluster, identifying specific ERVs that are highly represented relative to their genome-wide distribution. Of all the annotated MLTR18A_MM elements, 25.2% were found within the Chr4 KZFP gene cluster. Similarly, a large portion of all annotated ERVs of other distinct families (18.5% for LTRIS6, 16.1% for MMTV-int, 14.8% for RLTR13D3, 13.3% for LTRIS3, and 11.2% for RLTR1D2_MM elements) were found in the same cluster. These percentages are particularly striking, given that the Chr4 KZFP gene cluster accounts for only 0.2% of the total BL6J genome. Analysis of the sequence divergence of ERV subfamilies (measured as percentage of divergence to the consensus) for which more than 2% of total annotations occur at the Chr4 KZFP gene cluster revealed that while genome-wide there is a continuum of divergence (as expected for TEs that independently accumulate mutations) the Chr4 cluster displayed several groups of ERV copies with nearly identical sequence divergence (Fig. 5a). This is distinct from the overall bimodal distribution of sequence divergence observed for some ERV families, and likely reflects the segmental duplication of ERVs as opposed to independent insertions. Similar analyses in the 129S1 and CAST strains, as well as in Mus spretus, revealed strain-specific differences in ERV representation and distribution of sequence divergence (Supplementary Fig. 11a, Supplementary Fig. 12), underscoring the dynamic nature of these loci.

Taken together, our findings suggest the following model (Fig. 5b): the integration of new ERVs within KZFP gene clusters may have increased the recombinogenic potential of these loci by promoting non-allelic homologous recombination events, driven by regions of microhomology shared between different ERVs. Repair processes leading to segmental duplications resulted in cluster expansion, possibly occasionally counterbalanced by repair events that caused cluster contraction. These two forces, operating independently and in parallel across different mouse strains, alongside species and strain specific ERV integrations, have likely driven the divergence of young KZFP gene clusters in mice. As a result, the loci now appear highly divergent, with the emergence of distinct young KZFP gene repertoires. This model would also explain how the content of ERVs and KZFP genes can increase in concert, with the KZFP gene clusters gaining more recombinogenic potential as they expand, due to the concomitant increase in repeat load. This raises an additional question: could the emergence of new KZFP genes that bind to and silence these newly integrated ERVs act as a brake on this self-reinforcing recombinogenic system?

To address whether there is a relationship between the ERVs that promoted KZFP gene cluster recombination and the emergence of new KZFP genes that could bind and repress them, we characterized the DNA binding properties of all the KZFPs encoded in the BL6J Chr4 cluster - with at least one amino acid difference - combining new ChIP-seq data for 42 KZFPs (including KZFPs with new or updated annotation) with previously published ChIP-seq data for 8 KZFPs. Thus, we generated a TE target map for all the KZFPs encoded in the BL6J Chr4 cluster (Fig. 6a, Supplementary Data 5). As expected from previous studies, we observed that several KZFPs specifically target distinct TEs, and we were able to identify target motifs by integrating canonical motif discovery from experimentally determined peak regions with target motif prediction accounting for the KZFP amino acid sequence (Supplementary Data 6). This analysis allowed us to identify the KZFPs responsible for targeting and silencing RLTR4 elements (Fig. 6b), that had been previously shown to be over-expressed in mESCs lacking the Chr4 cluster and for which the specific KZFP responsible for their repression in wild-type cells had remained unknown. Furthermore, we found that many KZFPs exhibiting specific TE targeting were unique to the BL6J strain compared to 129S1 and CAST strains. Among them, we found several KZFPs that target distinct subfamilies of IAP LTRs and IAPEz internal regions (Fig. 6c), thereby identifying the modifiers likely responsible for reported variable methylation of these IAP elements in different mouse strains. Lastly, we found examples of KZFPs that did not display strong or specific binding to TEs, and for which we could not even identify a general target motif (Supplementary Data 6), suggesting that not all the KZFPs that have emerged thus far in the BL6J Chr4 cluster have a specific function.

Interestingly, we also observed that several ERVs targeted by KZFPs encoded in the Chr4 cluster are moderately enriched at this locus, although they do not exhibit strong enrichment (Supplementary Fig. 10g, Fig. 6a). This observation hints to the intriguing possibility that the emergence of KZFPs targeting these ERVs may have acted as a brake on their enrichment. We speculate that KZFPs may limit the further expansion of their target ERVs in two ways synergistically: as KZFPs can repress the ERVs they bind to, they can reduce their retrotransposition; at the same time, as the ERVs cannot increase their numbers by new integrations, they cannot further increase the recombinogenic potential of the KZFP gene cluster locus, further limiting their expansion by segmental duplication.

Previous articleNext article

POPULAR CATEGORY

misc

18058

entertainment

18984

corporate

15762

research

9695

wellness

15683

athletics

20061