Boltz-1 GitHub | Boltz-1 Paper
Boltz-1 is a cutting-edge, open-source AI model designed for biomolecular structure prediction. It rivals the performance of DeepMind’s AlphaFold3 in modeling proteins and their complexes. Unlike conventional docking tools, Boltz-1 employs a diffusion-based generative approach to predict 3D structures of protein–ligand complexes with remarkable precision. Integrating evolutionary data achieves unparalleled accuracy, making it a powerful tool for drug discovery and structural biology.
This guide demonstrates how to use Boltz-1 for drug discovery, specifically for modeling the binding of small-molecule inhibitors to a well-known protein target. You'll walk through choosing a target and ligands, setting up an AWS EC2 GPU instance, preparing inputs, running the model, and interpreting results.
We choose SARS-CoV-2 Mpro (Main protease / 3CL protease), an enzyme essential for coronavirus replication. Its structure is well-known (e.g., PDB 6LU7).
SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTSEDMLNPNYEDLLIRKSNHNFLVQAGNVQLRV IGHSMQNCVLKLKVDTANPKTPKYKFVRIQPGQTFSVLACYNGSPSGVYQCAMRPNFTIKGSFLNGSCGSVGFNIDYD CVSFCYMHHMELPTGVHAGTDLEGNFYGPFVDRQTAQAAGTDTTITVNVLAWLYAAVINGDRWFLNRFTTTLNDFNLV AMKYNYEPLTQDHVDILGPLSAQTGIAVLDMCASLKELLQNGMNGRTILGSALLEDEFTPFDVVRQCSGVTFQ
Two known inhibitors are selected:
An antiviral agent that inhibits the SARS-CoV-2 main protease (Mpro). This compound has been co-crystallized with Mpro, and its structures are available in the Protein Data Bank (PDB). Notable entries include:
PDB ID: 7SI9
This structure presents the SARS-CoV-2 main protease in complex with Nirmatrelvir, determined using X-ray diffraction at a resolution of 1.80 Å.
PDB ID: 7VH8
This entry provides another crystal structure of the SARS-CoV-2 main protease bound to Nirmatrelvir, resolved at 1.60 Å.
N#C[C@H](C[C@@H]1CCNC1=O)NC(=O)[C@H]1N(C[C@H]2[C@@H]1C2(C)C)C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F
X77 is a non-covalent inhibitor that binds to the SARS-CoV-2 main protease (Mpro) with high affinity (Kd ≈ 57 nM). Its structure has been co-crystallized with Mpro and is available in the Protein Data Bank (PDB) under the following entries:
PDB ID: 6W79: This structure presents the SARS-CoV-2 main protease in complex with the X77 inhibitor, determined using X-ray diffraction at a resolution of 1.46 Å. RCSB PDB: Homepage
PDB ID: 7PHZ: This entry provides another crystal structure of the SARS-CoV-2 main protease bound to X77, resolved at 1.66 Å. RCSB PDB: Homepage
These structures offer detailed insights into the binding interactions between X77 and Mpro, which are valuable for drug discovery efforts targeting SARS-CoV-2.
CC(C)(C)c1ccc(cc1)N([C@@H](C(=O)NC1CCCCC1)c2cccnc2)C(=O)c3c[nH]cn3
Using Boltz-1 to:
Our goal is to use Boltz-1 for virtual screening/pose prediction. Given the protein target (Mpro) and candidate ligands (Nirmatrelvir, X77), Boltz-1 will predict the 3D structure of each protein–ligand complex. This serves to: Validate known binders by checking if Boltz-1 places them in the active site with high confidence. Screen new compounds by comparing predicted binding confidence scores. A higher interface confidence from Boltz-1 could indicate a promising binder, guiding which molecules to prioritize for synthesis or experimental testing. Ligand optimization by tweaking input SMILES (e.g., modifying functional groups) and re-running Boltz-1 to see how the binding pose or score changes, thus suggesting beneficial modifications. This is an advanced alternative to classical docking: Boltz-1 considers protein flexibility and learned interactions, potentially yielding more accurate poses (PubMedCentral, drugtargetreview). In this guide, we demonstrate the process for our two example ligands, but it can be extended to any number of candidate SMILES in a virtual screen.
We recommend using a GPU-enabled device, preferably g6.2xlarge or better, to ensure optimal performance.
READY
state, click on it to proceed.Boltz-1
project to the compute unit. During this step, you will be prompted to configure parameters such as: Ensure all parameters are correctly set before starting the prediction process.
To prepare the input files for Boltz-1, create the following YAML files and place them in a directory (e.g., /volume/boltz_inputs/
). Ensure the directory exists before uploading the files.
version: 1 sequences: - protein: id: A sequence: >- SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTSEDMLNPNYEDLLIRKSNHNFLVQAGNVQLRV IGHSMQNCVLKLKVDTANPKTPKYKFVRIQPGQTFSVLACYNGSPSGVYQCAMRPNFTIKGSFLNGSCGSVGFNIDYD CVSFCYMHHMELPTGVHAGTDLEGNFYGPFVDRQTAQAAGTDTTITVNVLAWLYAAVINGDRWFLNRFTTTLNDFNLV AMKYNYEPLTQDHVDILGPLSAQTGIAVLDMCASLKELLQNGMNGRTILGSALLEDEFTPFDVVRQCSGVTFQ - ligand: id: X smiles: "N#C[C@H](C[C@@H]1CCNC1=O)NC(=O)[C@H]1N(C[C@H]2[C@@H]1C2(C)C)C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F"
Create mpro_x77.yaml
similarly, changing the SMILES to that of X77
version: 1 sequences: - protein: id: A sequence: >- SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTSEDMLNPNYEDLLIRKSNHNFLVQAGNVQLRV IGHSMQNCVLKLKVDTANPKTPKYKFVRIQPGQTFSVLACYNGSPSGVYQCAMRPNFTIKGSFLNGSCGSVGFNIDYD CVSFCYMHHMELPTGVHAGTDLEGNFYGPFVDRQTAQAAGTDTTITVNVLAWLYAAVINGDRWFLNRFTTTLNDFNLV AMKYNYEPLTQDHVDILGPLSAQTGIAVLDMCASLKELLQNGMNGRTILGSALLEDEFTPFDVVRQCSGVTFQ - ligand: id: X smiles: "CC(C)(C)c1ccc(cc1)N([C@@H](C(=O)NC1CCCCC1)c2cccnc2)C(=O)c3c[nH]cn3"
You can upload the files by clicking the Browser icon on your project or instance page. Alternatively, navigate to the Storage page and locate your Compute-Unit in the left-hand panel.
To execute the Boltz-1 model, ensure the input YAML files are uploaded to the /volume/boltz_inputs/
directory. Use the following script to run the prediction:
#!/bin/bash # Set precision for matrix operations export TORCH_FORCE_FLOAT32_MATMUL_PRECISION=medium # Run Boltz-1 prediction boltz predict /volume/boltz_inputs/ \ --use_msa_server \ --out_dir /volume/boltz_output/ \ --cache /volume/cache \ --recycling_steps 20 \ --diffusion_samples 5
/volume/boltz_inputs/
should contain the prepared YAML files (e.g., mpro_nirmatrelvir.yaml
and mpro_x77.yaml
)./volume/boltz_output/
.--use_msa_server
flag enables automatic fetching of multiple sequence alignments (MSAs) for the protein sequence.--diffusion_samples 5
) to get an ensemble of predictions, but this increases runtime. You can also adjust --recycling_steps
(the number of refinement iterations, default is often 3 or so) – more steps can improve accuracy at the cost of memory. Running time: Expect several minutes per complex on a T4/V100-class GPU, depending on protein length and MSA size. Our 306aa protein with one ligand should take on the order of 5–15 minutes for a single sample once the MSA is ready (times can vary). As Boltz-1 runs, it will print progress and eventually report saving the prediction. After it processes both YAML files, check the output directory.--recycling_steps
or --diffusion_samples
..pdb
files) and confidence metrics (.json
files) for each protein–ligand complex.By following these steps, you can efficiently run Boltz-1 predictions and generate high-confidence protein–ligand complex structures.
Tip: Enhance Your Workflow with DiPhyx Development Tools
To streamline debugging, monitoring, and post-processing of results, consider running a development environment such as Coder, Visual Studio Code, or Jupyter Lab with the /volume
folder mounted. This setup allows you to:
By integrating these tools, you can improve efficiency and ensure a smoother experience when working with Boltz-1.
Boltz-1 organizes results in the specified --output_dir
. Using our example path `/volume/boltz_output``, the structure will look like this prediction_folder, files_structure
Each prediction folder contains:
*_model_0.pdb
– predicted 3D structureconfidence_score
ligand_iptm
ptm
, iptm
, plddt
, pair_chains_iptm
boltz_output/ ├── predictions/ │ ├── mpro_nirmatrelvir/ │ │ ├── mpro_nirmatrelvir_model_0.pdb │ │ ├── confidence_mpro_nirmatrelvir_model_0.json │ │ ├── plddt_mpro_nirmatrelvir_model_0.npz │ │ └── ... (other outputs) │ └── mpro_x77/ │ ├── mpro_x77_model_0.pdb │ ├── confidence_mpro_x77_model_0.json │ └── ... └── processed/ (processed input features)
Each input file gets its own subfolder under predictions/. The main files of interest are:
*_model_0.pdb
: The predicted 3D structure of the complex (protein and ligand coordinates).confidence_*.json
: A JSON file containing confidence metrics for the prediction. Key metrics include:confidence_score
: An overall confidence score (0 to 1), which is a weighted combination of pLDDT (local model confidence) and interface TM-score.ptm
/ iptm
: Predicted TM-score for the whole complex and at interfaces, respectively. Higher values (closer to 1) indicate confidence in the relative arrangement.ligand_iptm
: The interface TM-score considering only the protein–ligand interfaceligand_iptm
suggests a well-defined binding mode. complex_plddt
/ complex_iplddt
: Average pLDDT (per-residue confidence) over the whole complex and interface. These values range from 0–100 or 0–1 (depending on scaling).pair_chains_iptm
: Pairwise interface scores for multiple chains.*_model_0.cif
: Present if using mmCIF format.npz
files: Contain per-residue or per-pair score arrays like PAE (predicted alignment error), similar to AlphaFold outputs. These are advanced details.Interpretation: For each ligand, compare the `confidence_score` and ligand_iptm
. Both Nirmatrelvir and X77 are true binders, so we expect high confidence. For instance, you might see a confidence_score ~0.8–0.9 (out of 1) and ligand_iptm significantly >0 (maybe ~0.5–0.8). If you were screening an unknown compound and saw a very low ligand_iptm (near 0), it might mean Boltz-1 could not dock the ligand stably – a sign that the compound is less likely to bind, or that multiple binding modes confused the model. On the other hand, high scores would encourage prioritizing that compound. Open the PDB structures to visualize the binding pose. Boltz-1 predictions often place known ligands correctly in the pocket (comparable to crystal poses), especially if confidence metrics are high. For Nirmatrelvir, check if the nitrile group is near the catalytic Cys145 – this would indicate Boltz recognized the covalent binding mode. For X77, see if it sits in the substrate-binding cleft, forming hydrogen bonds with residues like Glu166 or a pi-stacking with His41 (as known from literature). Because Boltz-1 leverages learned structural patterns, it may also adjust the protein conformation to accommodate the ligand, something static docking might miss. For example, loop movements or side-chain rotations in the binding site could be predicted. If you included multiple diffusion samples (--diffusion_samples n
), you would have model_0, model_1, ..., up to model_{n-1} files, each a slightly different sampled pose. In that case, Boltz-1 orders them by confidence by default (model_0 is highest confidence). You could compare these for alternate binding modes.