docsboltz-1
Last edit April 11, 2025

Predicting Protein–Ligand Binding with Boltz-1: A Practical Guide (SARS-CoV-2 Mpro Example)

Boltz-1 GitHub | Boltz-1 Paper   

Boltz-1 is a cutting-edge, open-source AI model designed for biomolecular structure prediction. It rivals the performance of DeepMind’s AlphaFold3 in modeling proteins and their complexes. Unlike conventional docking tools, Boltz-1 employs a diffusion-based generative approach to predict 3D structures of protein–ligand complexes with remarkable precision. Integrating evolutionary data achieves unparalleled accuracy, making it a powerful tool for drug discovery and structural biology.

This guide demonstrates how to use Boltz-1 for drug discovery, specifically for modeling the binding of small-molecule inhibitors to a well-known protein target. You'll walk through choosing a target and ligands, setting up an AWS EC2 GPU instance, preparing inputs, running the model, and interpreting results.

1. Protein Target: SARS-CoV-2 Main Protease (Mpro)

We choose SARS-CoV-2 Mpro (Main protease / 3CL protease), an enzyme essential for coronavirus replication. Its structure is well-known (e.g., PDB 6LU7).

Protein Sequence

xml
SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTSEDMLNPNYEDLLIRKSNHNFLVQAGNVQLRV
IGHSMQNCVLKLKVDTANPKTPKYKFVRIQPGQTFSVLACYNGSPSGVYQCAMRPNFTIKGSFLNGSCGSVGFNIDYD
CVSFCYMHHMELPTGVHAGTDLEGNFYGPFVDRQTAQAAGTDTTITVNVLAWLYAAVINGDRWFLNRFTTTLNDFNLV
AMKYNYEPLTQDHVDILGPLSAQTGIAVLDMCASLKELLQNGMNGRTILGSALLEDEFTPFDVVRQCSGVTFQ

2. Ligands

Two known inhibitors are selected:

Nirmatrelvir (PF-07321332)

An antiviral agent that inhibits the SARS-CoV-2 main protease (Mpro). This compound has been co-crystallized with Mpro, and its structures are available in the Protein Data Bank (PDB). Notable entries include:​

PDB ID: 7SI9

This structure presents the SARS-CoV-2 main protease in complex with Nirmatrelvir, determined using X-ray diffraction at a resolution of 1.80 Å.​

PDB ID: 7VH8

This entry provides another crystal structure of the SARS-CoV-2 main protease bound to Nirmatrelvir, resolved at 1.60 Å.

  • SMILES: N#C[C@H](C[C@@H]1CCNC1=O)NC(=O)[C@H]1N(C[C@H]2[C@@H]1C2(C)C)C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F

X77 (non-covalent binder)

X77 is a non-covalent inhibitor that binds to the SARS-CoV-2 main protease (Mpro) with high affinity (Kd ≈ 57 nM). Its structure has been co-crystallized with Mpro and is available in the Protein Data Bank (PDB) under the following entries:​

PDB ID: 6W79: This structure presents the SARS-CoV-2 main protease in complex with the X77 inhibitor, determined using X-ray diffraction at a resolution of 1.46 Å. ​RCSB PDB: Homepage

PDB ID: 7PHZThis entry provides another crystal structure of the SARS-CoV-2 main protease bound to X77, resolved at 1.66 Å. ​RCSB PDB: Homepage

These structures offer detailed insights into the binding interactions between X77 and Mpro, which are valuable for drug discovery efforts targeting SARS-CoV-2.

  • SMILES: CC(C)(C)c1ccc(cc1)N([C@@H](C(=O)NC1CCCCC1)c2cccnc2)C(=O)c3c[nH]cn3

3. Use Case: Virtual Screening & Pose Prediction

Using Boltz-1 to:

  • Validate known ligand poses
  • Compare binding confidence
  • Suggest ligand optimization strategies

Our goal is to use Boltz-1 for virtual screening/pose prediction. Given the protein target (Mpro) and candidate ligands (Nirmatrelvir, X77), Boltz-1 will predict the 3D structure of each protein–ligand complex. This serves to: Validate known binders by checking if Boltz-1 places them in the active site with high confidence. Screen new compounds by comparing predicted binding confidence scores. A higher interface confidence from Boltz-1 could indicate a promising binder, guiding which molecules to prioritize for synthesis or experimental testing. Ligand optimization by tweaking input SMILES (e.g., modifying functional groups) and re-running Boltz-1 to see how the binding pose or score changes, thus suggesting beneficial modifications. This is an advanced alternative to classical docking: Boltz-1 considers protein flexibility and learned interactions, potentially yielding more accurate poses (PubMedCentraldrugtargetreview). In this guide, we demonstrate the process for our two example ligands, but it can be extended to any number of candidate SMILES in a virtual screen.

***

4. Step-by-Step: Run Boltz-1 on DiPhyx

4.1 Create a Compute-Unit

We recommend using a GPU-enabled device, preferably g6.2xlarge or better, to ensure optimal performance.

  1. Create the Compute UnitNavigate to your cloud provider's interface and create a compute unit with the required specifications. Ensure the GPU is enabled.
  1. Wait for the Compute Unit to be Ready: Once the compute unit is in the READY state, click on it to proceed.
  2. Add the Boltz-1 Project: Attach the Boltz-1 project to the compute unit. During this step, you will be prompted to configure parameters such as: 
  • Host and Project Mounted Volumes: Specify the directories for input and output data.
  • Execution Script: Provide the script that will run the Boltz-1 prediction.

Ensure all parameters are correctly set before starting the prediction process.

4.2 Prepare Input YAML Files

To prepare the input files for Boltz-1, create the following YAML files and place them in a directory (e.g., /volume/boltz_inputs/). Ensure the directory exists before uploading the files.

yaml
mpro_nirmatrelvir.yaml553 B
Download
yaml
version: 1
sequences:
  - protein:
      id: A
      sequence: >-
        SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTSEDMLNPNYEDLLIRKSNHNFLVQAGNVQLRV
        IGHSMQNCVLKLKVDTANPKTPKYKFVRIQPGQTFSVLACYNGSPSGVYQCAMRPNFTIKGSFLNGSCGSVGFNIDYD
        CVSFCYMHHMELPTGVHAGTDLEGNFYGPFVDRQTAQAAGTDTTITVNVLAWLYAAVINGDRWFLNRFTTTLNDFNLV
        AMKYNYEPLTQDHVDILGPLSAQTGIAVLDMCASLKELLQNGMNGRTILGSALLEDEFTPFDVVRQCSGVTFQ
  - ligand:
      id: X
      smiles: "N#C[C@H](C[C@@H]1CCNC1=O)NC(=O)[C@H]1N(C[C@H]2[C@@H]1C2(C)C)C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F"

  Create mpro_x77.yaml similarly, changing the SMILES to that of X77  

yaml
mpro_x77.yaml525 B
Download
yaml
version: 1
sequences:
  - protein:
      id: A
      sequence: >-
        SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTSEDMLNPNYEDLLIRKSNHNFLVQAGNVQLRV
        IGHSMQNCVLKLKVDTANPKTPKYKFVRIQPGQTFSVLACYNGSPSGVYQCAMRPNFTIKGSFLNGSCGSVGFNIDYD
        CVSFCYMHHMELPTGVHAGTDLEGNFYGPFVDRQTAQAAGTDTTITVNVLAWLYAAVINGDRWFLNRFTTTLNDFNLV
        AMKYNYEPLTQDHVDILGPLSAQTGIAVLDMCASLKELLQNGMNGRTILGSALLEDEFTPFDVVRQCSGVTFQ
  - ligand:
      id: X
      smiles: "CC(C)(C)c1ccc(cc1)N([C@@H](C(=O)NC1CCCCC1)c2cccnc2)C(=O)c3c[nH]cn3"

  You can upload the files by clicking the Browser icon on your project or instance page. Alternatively, navigate to the Storage page and locate your Compute-Unit in the left-hand panel.  

4.4 Run Prediction

To execute the Boltz-1 model, ensure the input YAML files are uploaded to the /volume/boltz_inputs/ directory. Use the following script to run the prediction:  

bash
#!/bin/bash
# Set precision for matrix operations
export TORCH_FORCE_FLOAT32_MATMUL_PRECISION=medium

# Run Boltz-1 prediction
boltz predict /volume/boltz_inputs/ \
  --use_msa_server \
  --out_dir /volume/boltz_output/ \
  --cache /volume/cache \
  --recycling_steps 20 \
  --diffusion_samples 5

Key Notes:

  • Input Directory:  /volume/boltz_inputs/ should contain the prepared YAML files (e.g., mpro_nirmatrelvir.yaml and mpro_x77.yaml).
  • Output Directory: Results will be saved in /volume/boltz_output/.
  • MSA Server: The --use_msa_server flag enables automatic fetching of multiple sequence alignments (MSAs) for the protein sequence.
  • Recycling Steps: Set to 20 for enhanced refinement of predictions. Adjust if memory constraints arise.
  • Diffusion Samples: Generates 5 structural predictions per input for ensemble analysis.
  • You can request multiple samples (e.g. --diffusion_samples 5) to get an ensemble of predictions, but this increases runtime. You can also adjust --recycling_steps (the number of refinement iterations, default is often 3 or so) – more steps can improve accuracy at the cost of memory. Running time: Expect several minutes per complex on a T4/V100-class GPU, depending on protein length and MSA size. Our 306aa protein with one ligand should take on the order of 5–15 minutes for a single sample once the MSA is ready (times can vary). As Boltz-1 runs, it will print progress and eventually report saving the prediction. After it processes both YAML files, check the output directory.

Expected Behavior:

  • Boltz-1 will process each YAML file in the input directory.
  • It will allocate GPU memory for heavy computations. Ensure your GPU has sufficient capacity; otherwise, reduce --recycling_steps or --diffusion_samples.
  • The output will include predicted 3D structures (.pdb files) and confidence metrics (.json files) for each protein–ligand complex.

Troubleshooting:

  • Out-of-Memory Errors: Use a larger GPU instance or reduce the number of recycling steps.
  • Slow MSA Fetching: This depends on the sequence length and server load. Be patient or consider precomputing MSAs if delays are significant.

By following these steps, you can efficiently run Boltz-1 predictions and generate high-confidence protein–ligand complex structures.

Tip: Enhance Your Workflow with DiPhyx Development Tools

To streamline debugging, monitoring, and post-processing of results, consider running a development environment such as CoderVisual Studio Code, or Jupyter Lab with the /volume folder mounted. This setup allows you to:

  • Debug: Easily inspect input files, scripts, and logs in real-time.
  • Monitor: Track the progress of Boltz-1 predictions and resource utilization.
  • Post-Process: Analyze output files, visualize structures, and refine workflows directly within the development environment.

By integrating these tools, you can improve efficiency and ensure a smoother experience when working with Boltz-1.

4.5 Examine and Interpret Results

Boltz-1 organizes results in the specified --output_dir. Using our example path `/volume/boltz_output``, the structure will look like this​ prediction_folderfiles_structure

Each prediction folder contains:

  • *_model_0.pdb – predicted 3D structure
  • confidence_*.json – includes:
    • confidence_score
    • ligand_iptm
    • ptm, iptm, plddt, pair_chains_iptm
bash
boltz_output/
├── predictions/
│   ├── mpro_nirmatrelvir/
│   │   ├── mpro_nirmatrelvir_model_0.pdb
│   │   ├── confidence_mpro_nirmatrelvir_model_0.json
│   │   ├── plddt_mpro_nirmatrelvir_model_0.npz
│   │   └── ... (other outputs)
│   └── mpro_x77/
│       ├── mpro_x77_model_0.pdb
│       ├── confidence_mpro_x77_model_0.json
│       └── ... 
└── processed/  (processed input features)

Each input file gets its own subfolder under predictions/. The main files of interest are:

  • *_model_0.pdb: The predicted 3D structure of the complex (protein and ligand coordinates).
    • Open this in PyMOL, UCSF Chimera, or any molecular viewer to inspect the binding pose.
    • Check that the ligand is located in the Mpro active site (near His41 and Cys145) and observe interactions.
  • confidence_*.json: A JSON file containing confidence metrics for the prediction. Key metrics include:
    • confidence_score: An overall confidence score (0 to 1), which is a weighted combination of pLDDT (local model confidence) and interface TM-score.
    • ptm / iptm: Predicted TM-score for the whole complex and at interfaces, respectively. Higher values (closer to 1) indicate confidence in the relative arrangement.
    • ligand_iptm: The interface TM-score considering only the protein–ligand interface
      • A higher ligand_iptm suggests a well-defined binding mode.
      • A value near 0 indicates the model did not find a consistent binding site, implying a poor binder.
    • complex_plddt/ complex_iplddt: Average pLDDT (per-residue confidence) over the whole complex and interface. These values range from 0–100 or 0–1 (depending on scaling).
    • pair_chains_iptm: Pairwise interface scores for multiple chains.
      • Not needed for single-chain + ligand cases, except the ligand might be treated as a “chain” X.
  • Additional files:
    • *_model_0.cif: Present if using mmCIF format.
    • .npz files: Contain per-residue or per-pair score arrays like PAE (predicted alignment error), similar to AlphaFold outputs. These are advanced details.

Interpretation: For each ligand, compare the `confidence_score` and ligand_iptm. Both Nirmatrelvir and X77 are true binders, so we expect high confidence. For instance, you might see a confidence_score ~0.8–0.9 (out of 1) and ligand_iptm significantly >0 (maybe ~0.5–0.8). If you were screening an unknown compound and saw a very low ligand_iptm (near 0), it might mean Boltz-1 could not dock the ligand stably – a sign that the compound is less likely to bind, or that multiple binding modes confused the model. On the other hand, high scores would encourage prioritizing that compound. Open the PDB structures to visualize the binding pose. Boltz-1 predictions often place known ligands correctly in the pocket (comparable to crystal poses), especially if confidence metrics are high. For Nirmatrelvir, check if the nitrile group is near the catalytic Cys145 – this would indicate Boltz recognized the covalent binding mode. For X77, see if it sits in the substrate-binding cleft, forming hydrogen bonds with residues like Glu166 or a pi-stacking with His41 (as known from literature). Because Boltz-1 leverages learned structural patterns, it may also adjust the protein conformation to accommodate the ligand, something static docking might miss. For example, loop movements or side-chain rotations in the binding site could be predicted. If you included multiple diffusion samples (--diffusion_samples n), you would have model_0, model_1, ..., up to model_{n-1} files, each a slightly different sampled pose. In that case, Boltz-1 orders them by confidence by default (model_0 is highest confidence). You could compare these for alternate binding modes.  

Interpreting Output

  • confidence_score ≈ 0.8–0.9 → strong prediction
  • ligand_iptm → measures how confidently ligand is docked
  • Visualize using Paraview or PyMol on DiPhyx