/ / 56 min read

Guide3D A Bi-planar X-ray Dataset for 3D Shape Reconstruction (Part 3)

Methodology

Utilizing the Guide3D dataset, we build a benchmark for the shape prediction task, a critical component in endovascular intervention. Accurate shape prediction of the guidewire is essential for successful navigation and intervention. Here, we introduce a novel shape prediction network designed to predict the guidewire shape from a sequence of monoplanar images. This approach leverages deep learning to learn spatio-temporal correlations from a static camera observing a dynamic scene. Unlike conventional reconstruction methods that require biplanar images, our network uses a sequence of images to extract temporal information, allowing it to map a single image IA\mathbf{I}_A to the 3D guidewire curve C(u)\mathbf{C}(\mathbf{u}). By adopting this deep learning approach, we aim to simplify the shape prediction process while maintaining high accuracy. This method has the potential to enhance endovascular navigation by providing real-time, accurate predictions of the guidewire shape, ultimately improving procedural outcomes and reducing reliance on specialized equipment.

Network Key Components: The figure illustrates the essential components of the proposed model. a) Spherical coordinates (r,θ,ϕ)(r, \theta, \phi) are used for predicting the guidewire shape. b) The model predicts the 3D shape of a guidewire from image sequences It\mathbf{I}_t. A Vision Transformer (ViT) extracts spatial features zt\mathbf{z}_t, which a Gated Recurrent Unit (GRU) processes to capture temporal dependencies, producing hidden states ht\mathbf{h}_t. The final hidden state drives three prediction heads: the Tip Prediction Head for the 3D tip position pR3\mathbf{p} \in \mathbb{R}^3, the Spherical Offset Prediction Head for coordinate offsets (Δϕ,Δθ)(\Delta \phi, \Delta \theta), and the Stop Prediction Head for terminal point probability S\mathbf{S}.

Spherical Coordinates Representation

Predicting 3D points directly can be challenging due to the high degree of freedom. To mitigate this, we use spherical coordinates, which offer significant advantages over Cartesian coordinates for guidewire shape prediction. Spherical coordinates, as represented in Fig. 1a, are defined by the radius rr, polar angle θ\theta, and azimuthal angle ϕ\phi. They provide a more natural representation for the position and orientation of points along the guidewire, which is typically elongated and curved.

Mathematically, a point in spherical coordinates (r,θ,ϕ)(r, \theta, \phi) can be converted to Cartesian coordinates (x,y,z)(x, y, z) using the transformations:

x=rsinθcosϕ,y=rsinθsinϕ,z=rcosθ.x = r \sin \theta \cos \phi, \quad y = r \sin \theta \sin \phi, \quad z = r \cos \theta.

This conversion simplifies the modeling of angular displacements and rotations, as spherical coordinates directly encode directional information.

Predicting angular displacements (Δθ,Δϕ)(\Delta \theta, \Delta \phi) relative to a known radius rr aligns with the physical constraints of the guidewire, facilitating more accurate and interpretable shape predictions. By predicting an initial point (tip position) and representing subsequent points as offsets in Δϕ\Delta \phi and Δθ\Delta \theta while keeping rr fixed, this method simplifies shape comparison and reduces the parameter space. This approach enhances the model’s ability to capture the guidewire’s spatial configuration and improves overall prediction performance.

Network Architecture

The proposed model (shown in Fig. 1b) addresses the problem of predicting the 3D shape of a guidewire from a sequence of images. Each image sequence captures the guidewire from different time steps IA,t\mathbf{I}_{A,t}, and the goal is to infer the continuous 3D shape Ct(ut)\mathbf{C}_t(\mathbf{u}_t). This many-to-one prediction task is akin to generating a variable-length sequence from variable-length input sequences, a technique commonly utilized in fields such as machine translation and video analysis.

To achieve this, the input pipeline consists of a sequence of images depicting the guidewire. A Vision Transformer (ViT), pre-trained on ImageNet, is employed to extract high-dimensional spatial feature representations from these images. The ViT generates feature maps ztR\mathbf{z}_t \in \mathbb{R}. These feature maps are then fed into a Gated Recurrent Unit (GRU) to capture the temporal dependencies across the image sequence. The GRU processes the feature maps zt\mathbf{z}_t from consecutive time steps, producing a sequence of hidden states ht\mathbf{h}_t. Formally, the GRU operation at time step tt is defined as:

ht=GRU(zt,ht1).\mathbf{h}_t = \text{GRU}(\mathbf{z}_t, \mathbf{h}_{t-1}).

The final hidden state ht\mathbf{h}_t from the GRU is used by three distinct prediction heads, each tailored for a specific aspect of the guidewire shape prediction: the Tip Prediction Head, responsible for predicting the 3D coordinates of the guidewire’s tip through a fully connected layer that maps the hidden state ht\mathbf{h}_t to a Cartesian anchoring point pR3\mathbf{p} \in \mathbb{R}^3; the Spherical Offset Prediction Head, which predicts the spherical coordinate offsets (Δϕ,Δθ)(\Delta \phi, \Delta \theta) for points along the guidewire with a fixed radius rr; and the Stop Prediction Head, which outputs the probability distribution indicating the terminal point of the guidewire by using a softmax layer to produce a probability tensor S\mathbf{S}, where each element Sj\mathbf{S}_j indicates the probability of the jj-th point being the terminal point.

Loss Function

The custom loss function for training the model combines multiple components to handle the point-wise tip error, variable guidewire length (stop criteria), and tip position predictions. The overall loss function Ltotal\mathcal{L}_{\text{total}} is defined as:

Ltotal=1Ni=1N(λtipp^ipi2+λoffset((ϕ^iϕi)2+(θ^iθi)2)+λstop(silog(s^i)(1si)log(1s^i)))\mathcal{L}_{\text{total}} = \frac{1}{N} \sum_{i=1}^N \bigg( \lambda_{\text{tip}} \left \| \hat{\mathbf{p}}_i - \mathbf{p}_i \right \|^2 + \lambda_{\text{offset}} \big( (\hat{\boldsymbol{\phi}}_i - \boldsymbol{\phi}_i)^2 + (\hat{\boldsymbol{\theta}}_i - \boldsymbol{\theta}_i)^2 \big) + \lambda_{\text{stop}} \big( -\mathbf{s}_i \log (\hat{\mathbf{s}}_i) - (1 - \mathbf{s}_i) \log (1 - \hat{\mathbf{s}}_i) \big) \bigg)

where NN is the number of samples, and λtip\lambda_{\text{tip}}, λoffset\lambda_{\text{offset}}, and λstop\lambda_{\text{stop}} are weights that balance the contributions of each loss component. The tip prediction loss (Ltip\mathcal{L}_{\text{tip}}) uses mean squared error (MSE) to ensure accurate 3D tip coordinates. The spherical offset loss (Loffset\mathcal{L}_{\text{offset}}) also uses MSE to align predicted and ground truth angular offsets, capturing the guidewire’s shape. The stop prediction loss (Lstop\mathcal{L}_{\text{stop}}) employs binary cross-entropy (BCE) to accurately predict the guidewire’s endpoint.

Training Details

The model was trained end-to-end using the loss from Equation above. The NAdam optimizer was used with an initial learning rate of 1×1041 \times 10^{-4}. Additionally, a learning rate scheduler was employed to adjust the learning rate dynamically based on the validation loss. Specifically, the ReduceLROnPlateau scheduler was configured to reduce the learning rate by a factor of 0.1 if the validation loss did not improve for 10 epochs. The model was trained for 400 epochs, with early stopping based on the validation loss to further prevent overfitting.

Next

In the next part, we will validate the effectiveness of Guidewire Shape Prediction dataset and methodology.

Like What You See?