Refined Protein Structural Prediction from 3D Patterson Maps via Recycling Training Runs

Qiutai Pan¹

Evan Dramko¹

Mitchell D. Miller¹

George N. Phillips Jr.¹

Anastasios Kyrillidis^1,2

¹Rice University

²Ken Kennedy Institute

Code [GitHub]

Paper [arXiv]

Abstract

Determining protein structures is crucial for understanding biological functions. Traditional methods like X-ray crystallography face challenges such as the phase problem. We present RecCrysFormer, a hybrid model that integrates transformers and convolutional layers to predict electron density maps directly from Patterson maps. Incorporating known partial structures and a recycling training regimen, our model achieves high accuracy on synthetic datasets, bridging experimental and computational approaches.

Introduction

Proteins, composed of amino acids, play vital roles in biological processes. Understanding their 3D structure is essential for insights into their functionality. While experimental methods like X-ray crystallography are widely used, they face challenges, including the crystallographic phase problem. Machine learning approaches like AlphaFold2 have advanced the field but often neglect the use of raw experimental data like Patterson maps.

RecCrysFormer addresses these challenges by combining domain-specific knowledge with machine learning architectures. By leveraging Patterson maps, partial structures, and a novel recycling training regimen, it provides accurate electron density predictions, enabling seamless integration with established crystallographic refinement processes.

Key Contributions

Our study introduces the following innovations:

Integration of Experimental and ML Data: Uses Patterson maps and standardized partial structures of amino acids.
Novel Architecture: Combines convolutional layers with vision transformers for efficient 3D data processing.
Recycling Training: Incorporates iterative refinement outputs as templates to improve predictions.
Robustness: Demonstrates accuracy on synthetic datasets with variable unit cell parameters.

Methods

The core of RecCrysFormer is its ability to process raw experimental data in the form of Patterson maps, which capture interatomic vector relationships in protein crystals. These maps, while useful, lack direct information about atomic coordinates. To bridge this gap, RecCrysFormer employs a novel hybrid architecture combining 3D Convolutional Neural Networks (CNNs) and vision transformers.

The workflow begins with Patterson maps and, optionally, partial structure electron density maps as inputs. The Patterson maps are divided into small 3D patches, each transformed into feature tokens through convolutional layers and multi-layer perceptrons (MLPs). Positional embeddings are added to retain spatial information, and the resulting tokens are passed through a transformer core. The transformer uses attention mechanisms to capture global dependencies, allowing the model to integrate both Patterson maps and partial structures into coherent electron density predictions.

Key innovations include:

Transformer Attention Mechanism: A one-way attention system is employed, ensuring that Patterson map tokens can attend to partial structure tokens but not vice versa. This reduces computational complexity while maintaining predictive accuracy.
Incorporation of Partial Structures: Precomputed electron density maps of standardized amino acid conformations are embedded as additional tokens, guiding the model in refining predictions.
Recycling Meta-Algorithm: After an initial training phase, the model’s predictions are passed through a crystallographic refinement process (e.g., SHELXE) to improve electron densities. These refined outputs are then recycled back into training as templates, iteratively improving model accuracy.
3D Reconstruction: After the transformer processes the tokens, a final sequence of 3D convolutional layers translates the token embeddings back into electron density maps, ensuring spatial coherence and alignment with experimental data.

The loss function combines mean squared error (MSE) with the negative Pearson correlation coefficient, a crystallographic metric that captures phase accuracy and structural alignment. This hybrid loss promotes both general and domain-specific optimization, ensuring the model performs well on both global and detailed features.

RecCrysFormer also supports partial protein structures, which are derived from well-studied amino acid conformations. These are integrated as a secondary input, enabling the model to refine predictions based on known molecular patterns.

The final result is a powerful, data-efficient architecture capable of translating experimental Patterson maps into accurate electron density maps. This capability represents a significant step toward bridging the gap between experimental crystallographic methods and machine learning.

Results

RecCrysFormer was benchmarked on synthetic datasets. Key findings include:

Accuracy: Achieved 93% of cases refined to a crystallographic R-factor < 0.38 with SHELXE.
Efficiency: Improved mean Pearson correlations and reduced phase errors across multiple training regimes.
Robustness: Demonstrated high performance on datasets with variable unit cell dimensions and angles.

Conclusions and Future Work

RecCrysFormer bridges experimental and computational methods, providing a robust framework for protein structure determination. Future work will involve scaling to full-length proteins, handling experimental noise, and extending the recycling procedure for enhanced performance.

Acknowledgements

This research was supported by the Welch Foundation Grant A22-0307 and the Ken Kennedy Institute at Rice University.