/ / 303 min read

Lightweight Language-driven Grasp Detection using Conditional Consistency Model (Part 2)

Methodology of Conditional Consistency Model for Lightweight Language-driven Grasp Detection.

Grasping Machine

1. Lightweight Language-driven Grasp Detection

Overview

Given an input RGB image and a text prompt describing the object of interest, we aim to detect the grasping pose on the image that best matches the text prompt input. We follow the popular rectangle grasp convention widely used in previous work to define the grasp.

In the diffusion model, we represent the target grasp pose as x0\mathbf{x}_0. The objective of our diffusion process of language-driven grasp detection involves denoising from a noisy state xT\mathbf{x}_T to the original grasp pose x0\mathbf{x}_0, conditioned on the input image and grasp instruction represented by yy. The forward process in traditional conditional diffusion model is defined as:

q(xtxt1)=N(1βtxt1,βtI) ,(1)q(\mathbf{x}_t|\mathbf{x}_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}\mathbf{x}_{t-1},\beta_t\mathbf{I})~, \tag{1}

where the hyperparameter βₜ is the amount of noise added at diffusion step t ∈ [0,T] ⊆ ℝ.

To train a diffusion model with condition y, we use a neural network to learn the reverse process:

pϕ(xt1xt,y)=N(μϕ(xt,t,y),Σϕ(xt,t,y)) .(2)p_\phi(\mathbf{x}_{t-1}|\mathbf{x}_t,y) = \mathcal{N}(\mu_\phi(\mathbf{x}_t,t,y),\Sigma_\phi(\mathbf{x}_t,t,y))~. \tag{2}

In our approach, we utilize the diffusion process in the continuous domain, where xt\mathbf{x}_t is the grasp pose state at arbitrary time index tt. Unlike popular discrete diffusion models, by using a continuous space, we can improve sample quality and reduce inference times due to the ability to traverse the diffusion process at arbitrary timesteps, allowing for more fine-grained control over the denoising process.

Method Overview

Figure 1: The overview of our method. First, the input RGB image and text prompt are fed into the feature encoder and ALBEF fusion. Subsequently, we concurrently train two models with the same architectures: A score network to estimate the probability flow Ordinary Differential Equation (ODE) trajectory for the diffusion process and a conditional consistency model to determine the grasp pose with a few denoising steps.

Conditional Consistency Model for LLGD

To reduce the inference time during the denoising step of the diffusion model, we aim to estimate the original grasp pose with just a few denoising steps. Since our language-driven grasp detection task has the condition yy, we introduce a conditional consistency model based on the consistency concept to infer the original grasp pose during the inference process directly:

fθ(xt,t,y)={xtt[0,ϵ]Fθ(xt,t,y)t(ϵ,T] ,(3)\mathbf{f}_\theta(\mathbf{x}_t,t,y) = \begin{cases} \mathbf{x}_t & t \in [0,\epsilon] \\ \mathbf{F}_\theta(\mathbf{x}_t,t,y) & t \in (\epsilon,T] \end{cases}~, \tag{3}

where fθ(xϵ,t,y)=xϵ\mathbf{f}_\theta(\mathbf{x}_\epsilon, t, y) = \mathbf{x}_\epsilon is the boundary condition, and Fθ(xt,t,y)\mathbf{F}_\theta(\mathbf{x}_t,t,y) is a free-form deep neural network whose output has the same dimensionality as xt\mathbf{x}_t.

To train our conditional consistency model, we employ knowledge distillation from a continuous diffusion process:

dxt=12γtxtdt+γtdwt ,(4)d\mathbf{x}_{t} = -\frac{1}{2}\gamma_t\mathbf{x}_t dt + \sqrt{\gamma_t} d\mathbf{w}_t~, \tag{4}

where γt\gamma_t is a non-negative function referred to as the noise schedule, and wt\mathbf{w}_t is the standard Brownian motion. This forward process creates a trajectory of grasp poses {xt}t=0T\{\mathbf{x}_t\}_{t=0}^T. The grasp pose state xt\mathbf{x}_t depends on the time index tt and the input image and text prompt. The grasp distribution p(x0y)p(\mathbf{x}_0|y) from the dataset is transformed into p(xTy)N(0,I)p(\mathbf{x}_T|y) \sim \mathcal{N}(0, \mathbf{I}). Given the ground truth grasp pose x0\mathbf{x}_0, we can sample xt\mathbf{x}_t at arbitrary tt:

p(xtx0)=N(μt,Σt) ,(5)p(\mathbf{x}_t|\mathbf{x}_0) = \mathcal{N}(\mu_t, \Sigma_t)~, \tag{5}

where

μt=e12ρtx0,Σt=(1eρt)I,ρt=0tγsds .\mu_t = e^{\frac{1}{2}\rho_t} \mathbf{x}_0, \Sigma_t = (1 - e^{\rho_t})\mathbf{I}, \rho_t = -\int_{0}^{t} \gamma_s ds~.

The equation (4) is a probability flow ODE. With the conditional variable yy, it can be redefined as:

dxtdt=12γt[xt+logp(xty)] ,(6)\frac{d\mathbf{x}_t}{dt} = -\frac{1}{2}\gamma_t\left[\mathbf{x}_t + \nabla\log p(\mathbf{x}_t|y)\right]~, \tag{6}

where logp(xty)\nabla\log p(\mathbf{x}_t|y) is the score function of the conditional diffusion model.

Suppose that we have a neural network sϕ(xt,t,y)\mathbf{s}_\phi(\mathbf{x}_t, t, y) that can approximate the score function logp(xty)\nabla\log p(\mathbf{x}_t|y), i.e., sϕ(xt,t,y)logp(xty)\mathbf{s}_\phi(\mathbf{x}_t, t, y) \approx \nabla\log p(\mathbf{x}_t|y). After training the score network, we can replace the logp(xty)\nabla\log p(\mathbf{x}_t|y) term in the equation (6) with a neural network:

dxtdt=12γt[xt+sϕ(xt,t,y)] .(7)\frac{d\mathbf{x}_t}{dt} = -\frac{1}{2}\gamma_t\left[\mathbf{x}_t + \mathbf{s}_\phi(\mathbf{x}_t, t, y)\right]~. \tag{7}

Score Function Loss. In order to approximate the score function logp(xty)\nabla\log p(\mathbf{x}_t|y), the conditional denoising estimator minimizes the following objective:

Lscore=EtU[0,T]x0,yp(x0,y)xtp(xtx0)[λ(t)logp(xtx0)sϕ(xt,t,y)2] ,(8)\mathcal{L}_{\rm score}=\mathbb{E}_{ \begin{subarray}{l} t \sim \mathcal{U}[0, T] \\ \mathbf{x}_0,y \sim p(\mathbf{x}_0,y) \\ \mathbf{x}_t \sim p(\mathbf{x}_t|\mathbf{x}_0) \end{subarray} }\left[\lambda(t) \|\nabla\log p(\mathbf{x}_t|\mathbf{x}_0) - \mathbf{s}_\phi(\mathbf{x}_t,t,y)\|^2 \right]~, \tag{8}

where λ(t)R+\lambda(t) \in \mathbb{R}^+ is a positive weighting function.

Proposition 1. Suppose that xt\mathbf{x}_t is conditionally independent of yy given x0\mathbf{x}_0, then minimizing Lscore\mathcal{L}_{\rm score} is the same as minimizing:

EtU[0,T]xt,yp(xt,y)[λ(t)logp(xty)sϕ(xt,t,y)2] .\mathbb{E}_{ \begin{subarray}{l} t \sim \mathcal{U}[0, T] \\ \mathbf{x}_t,y \sim p(\mathbf{x}_t,y) \\ \end{subarray} }\left[\lambda(t) \|\nabla\log p(\mathbf{x}_t|y) - \mathbf{s}_\phi(\mathbf{x}_t,t,y)\|^2 \right]~.

Proof. Because xt\mathbf{x}_t is conditionally independent of yy given x0\mathbf{x}_0, we have:

EtU[0,T]x0,yp(x0,y)xtp(xtx0)[λ(t)logp(xtx0)sϕ(xt,t,y)2]=EtU[0,T]yp(y)x0p(x0y)xtp(xtx0)[λ(t)logp(xtx0)sϕ(xt,t,y)2]=EtU[0,T]yp(y)x0p(x0y)xtp(xtx0,y)[λ(t)logp(xtx0,y)sϕ(xt,t,y)2]=EtU[0,T]yp(y)[Φ(t,y)] ,(9)\begin{aligned} &\mathbb{E}_{ \begin{subarray}{l} t \sim \mathcal{U}[0, T] \\ \mathbf{x}_0,y \sim p(\mathbf{x}_0,y) \\ \mathbf{x}_t \sim p(\mathbf{x}_t|\mathbf{x}_0) \end{subarray} }\left[\lambda(t) \|\nabla\log p(\mathbf{x}_t|\mathbf{x}_0) - \mathbf{s}_\phi(\mathbf{x}_t,t,y)\|^2 \right] \\ &= \mathbb{E}_{ \begin{subarray}{l} t \sim \mathcal{U}[0, T] \\ y \sim p(y) \\ \mathbf{x}_0 \sim p(\mathbf{x}_0|y)\\ \mathbf{x}_t \sim p(\mathbf{x}_t|\mathbf{x}_0) \end{subarray} }\left[\lambda(t) \|\nabla\log p(\mathbf{x}_t|\mathbf{x}_0) - \mathbf{s}_\phi(\mathbf{x}_t,t,y)\|^2 \right] \\ &= \mathbb{E}_{ \begin{subarray}{l} t \sim \mathcal{U}[0, T] \\ y \sim p(y) \\ \mathbf{x}_0 \sim p(\mathbf{x}_0|y)\\ \mathbf{x}_t \sim p(\mathbf{x}_t|\mathbf{x}_0,y) \end{subarray} }\left[\lambda(t) \|\nabla\log p(\mathbf{x}_t|\mathbf{x}_0,y) - \mathbf{s}_\phi(\mathbf{x}_t,t,y)\|^2 \right] \\ &= \mathbb{E}_{ \begin{subarray}{l} t \sim \mathcal{U}[0, T] \\ y \sim p(y) \\ \end{subarray} }\left[\Phi(t,y)\right]~, \tag{9} \end{aligned}

where

Φ(t,y)=Ex0p(x0y)xtp(xtx0,y)[λ(t)logp(xtx0,y)sϕ(xt,t,y)2] .\begin{aligned} &\Phi(t,y)\\ &=\mathbb{E}_{ \begin{subarray}{l} \mathbf{x}_0 \sim p(\mathbf{x}_0|y)\\ \mathbf{x}_t \sim p(\mathbf{x}_t|\mathbf{x}_0,y) \end{subarray} }\left[\lambda(t) \|\nabla\log p(\mathbf{x}_t|\mathbf{x}_0,y) - \mathbf{s}_\phi(\mathbf{x}_t,t,y)\|^2 \right]~. \end{aligned}

If yy and tt are fixed, we can define a transition probability that does not depend on these variables, q(x0)=p(x0y)q(\mathbf{x}_0) = p(\mathbf{x}_0|y), κ(xt)=sϕ(xt,t,y)\kappa(\mathbf{x}_t)=\mathbf{s}_\phi(\mathbf{x}_t,t,y). According to Vincent P., 2011, we have:

Φ(t,y)=Ex0q(x0)xtq(xtx0)[λ(t)logq(xtx0)κ(xt)2]=E(x0,xt)q(x0,xt)[λ(t)logq(xtx0)κ(xt)2]=Extq(xt)[λ(t)logq(xt)κ(xt)2]=Extp(xty)[λ(t)logp(xty)sϕ(xt,t,y)2] .(10)\begin{aligned} \Phi(t,y) &= \mathbb{E}_{ \begin{subarray}{l} \mathbf{x}_0 \sim q(\mathbf{x}_0)\\ \mathbf{x}_t \sim q(\mathbf{x}_t|\mathbf{x}_0) \end{subarray} }\left[\lambda(t) \|\nabla\log q(\mathbf{x}_t|\mathbf{x}_0) - \kappa(\mathbf{x}_t)\|^2 \right] \\ &= \mathbb{E}_{ \begin{subarray}{l} (\mathbf{x}_0,\mathbf{x}_t) \sim q(\mathbf{x}_0,\mathbf{x}_t)\\ \end{subarray} }\left[\lambda(t) \|\nabla\log q(\mathbf{x}_t|\mathbf{x}_0) - \kappa(\mathbf{x}_t)\|^2 \right] \\ &= \mathbb{E}_{ \begin{subarray}{l} \mathbf{x}_t \sim q(\mathbf{x}_t)\\ \end{subarray} }\left[\lambda(t) \|\nabla\log q(\mathbf{x}_t) - \kappa(\mathbf{x}_t)\|^2 \right] \\ &= \mathbb{E}_{ \begin{subarray}{l} \mathbf{x}_t \sim p(\mathbf{x}_t|y)\\ \end{subarray} }\left[\lambda(t) \|\nabla\log p(\mathbf{x}_t|y) - \mathbf{s}_\phi(\mathbf{x}_t,t,y)\|^2 \right]~. \tag{10} \end{aligned}

From the equations (9) and (10), we can prove the equivalence of the two objective functions.

EtU[0,T]x0,yp(x0,y)xtp(xtx0)[λ(t)logp(xtx0)sϕ(xt,t,y)2]=EtU[0,T]yp(y)xtp(xty)[λ(t)logp(xty)sϕ(xt,t,y)2]=EtU[0,T](xt,y)p(xt,y)[λ(t)logp(xty)sϕ(xt,t,y)2] .(11)\begin{aligned} &\mathbb{E}_{ \begin{subarray}{l} t \sim \mathcal{U}[0, T] \\ \mathbf{x}_0,y \sim p(\mathbf{x}_0,y) \\ \mathbf{x}_t \sim p(\mathbf{x}_t|\mathbf{x}_0) \end{subarray} }\left[\lambda(t) \|\nabla\log p(\mathbf{x}_t|\mathbf{x}_0) - \mathbf{s}_\phi(\mathbf{x}_t,t,y)\|^2 \right] \\ =& \mathbb{E}_{ \begin{subarray}{l} t \sim \mathcal{U}[0, T] \\ y \sim p(y) \\ \mathbf{x}_t \sim p(\mathbf{x}_t|y) \end{subarray} }\left[\lambda(t) \|\nabla\log p(\mathbf{x}_t|y) - \mathbf{s}_\phi(\mathbf{x}_t,t,y)\|^2 \right] \\ =& \mathbb{E}_{ \begin{subarray}{l} t \sim \mathcal{U}[0, T] \\ (\mathbf{x}_t,y) \sim p(\mathbf{x}_t,y) \\ \end{subarray} }\left[\lambda(t) \|\nabla\log p(\mathbf{x}_t|y) - \mathbf{s}_\phi(\mathbf{x}_t,t,y)\|^2 \right]~. \tag{11} \end{aligned}

Discretization. Consider discretizing the time horizon [ϵ,T][\epsilon,T] into N1N-1 with boundary t1=ϵ<t2<t3<<tN=Tt_1=\epsilon<t_2<t_3<\ldots<t_{N}=T. If NN is sufficiently large, we can use an ODE-solver to estimate the next discretization step:

x^ti=xti+1+(titi+1)dxdtt=ti+1\hat{\mathbf{x}}_{t_i} = \mathbf{x}_{t_{i+1}} + (t_i - t_{i+1}) \left. \frac{d\mathbf{x}}{dt} \right|_{t = t_{i+1}}
=xti+112γi+1(titi+1)[xti+1+sϕ(xt,t,y)] .(12)= \mathbf{x}_{t_{i+1}} - \frac{1}{2}\gamma_{i+1} (t_i - t_{i+1})\left[\mathbf{x}_{t_{i+1}} + \mathbf{s}_\phi(\mathbf{x}_t,t,y)\right]~. \tag{12}

Conditional Consistency Model Loss. To enable fast sampling, we expect that the predicted point x^ti\hat{\mathbf{x}}_{t_i} and xti+1\mathbf{x}_{t_{i+1}} to lie on the same probability flow ODE trajectory. We propose conditional consistency loss to enforce this constraint:

Lconsistency=EiU[1,N1]xti+1p(xti+1x0)[λ(ti)fθ(xti+1,ti+1,y)fθ(x^ti,ti,y)2] ,(13)\mathcal{L}_{\rm consistency} = \mathbb{E}_{ \begin{subarray}{l} i \sim \mathcal{U}[1, N - 1] \\ \mathbf{x}_{t_{i+1}} \sim p(\mathbf{x}_{t_{i+1}}|\mathbf{x}_0) \end{subarray} } \left[\lambda(t_i) \|\mathbf{f}_\theta(\mathbf{x}_{t_{i+1}},t_{i+1},y) - \mathbf{f}_{\theta^*}(\hat{\mathbf{x}}_{t_{i}},t_{i},y)\|^2 \right]~, \tag{13}

where x^ti\hat{\mathbf{x}}_{t_i} is calculated in Equation 12, xti+1\mathbf{x}_{t_{i+1}} is sampling from Gaussian distribution in Equation 5, and θ\theta is the parameters of neural network f\mathbf{f}.

Additionally, we need to minimize the discrepancy between the predicted and ground truth grasp poses with the detection loss:

Ldetection=EiU[1,N]xtiN(μti,Σti)x0,yp(x0,y)[λ(ti)fθ(xti,ti,y)x02] .(14)\mathcal{L}_{\rm detection} = \mathbb{E}_{ \begin{subarray}{l} i \sim \mathcal{U}[1, N] \\ \mathbf{x}_{t_{i}} \sim \mathcal{N}(\mu_{t_{i}},\Sigma_{t_{i}}) \\ \mathbf{x}_0,y \sim p(\mathbf{x}_0,y) \end{subarray} }\left[\lambda(t_i)\|\mathbf{f}_\theta(\mathbf{x}_{t_i}, t_i, y) - \mathbf{x}_0\|^2\right]~. \tag{14}

The overall training objective for our method is:

Ltotal=Lconsistency+Ldetection .(15)\mathcal{L}_{\rm total} = \mathcal{L}_{\rm consistency} + \mathcal{L}_{\rm detection}~. \tag{15}

Network Details

The input of our network is the image and a corresponding grasping text prompt represented as ee (for example, "grasp the fork at its handle"). We first extract the image feature using a 12-layer vision transformer ViT image encoder. The input text prompt is encoded by a text encoder using BERT or CLIP. We then combine and learn the features of the input text prompt and input image using the ALBEF fusion network. The output of the fusion features is fed into a score network, and our conditional consistency model is used to learn the grasp pose. Figure 1 shows the detail of our network.

Score Network. In practice, we utilize a score network composed of several MLP layers to extract three components: the noisy grasp pose xt\mathbf{x}_t, the time index tt, and the conditional vision-language embedding yy. Subsequently, these features are concatenated, and the score function is extracted through a final MLP layer. It is crucial to ensure that the output dimension of the scoring network is identical to the dimension of the input xt\mathbf{x}_t because, fundamentally, the score function is the gradient of the grasp pose distribution given the condition yy. Our conditional consistency model's network has an architecture similar to the scoring network; however, its output is the predicted grasp pose.

Algorithm 1: Inference Process

Input: Image and text prompt, conditional consistency model fθ(x,t,y)\mathbf{f}_\theta(\mathbf{x},t,y), number of inference steps PP, sequence of time points t1=ϵ<t2<t3<<tP=Tt_1 = \epsilon < t_2 < t_3 < \dots < t_{P} = T, noise scheduler αt=eρt\alpha_t = e^{\rho_t}.

yALBEF (image, prompt)y \gets \text{ALBEF (image, prompt)}

Initial grasp noise xTN(0,I)\mathbf{x}_T \sim \mathcal{N}(0,\mathbf{I})

x0fθ(xT,T,y)\mathbf{x}_0 \gets \mathbf{f}_\theta(\mathbf{x}_T,T,y)

For i=P1i = P - 1 to 22:

  • Sample zN(0,I)\mathbf{z} \sim \mathcal{N}(0,\mathbf{I})
  • xtiαtix0+1αtiz\mathbf{x}_{t_i} \gets \sqrt{\alpha_{t_i}}\mathbf{x}_0 + \sqrt{1 - \alpha_{t_i}}\mathbf{z}
  • x0fθ(xti,ti,y)\mathbf{x}_0 \gets \mathbf{f}_\theta(\mathbf{x}_{t_i},t_i,y)

Output: Final grasp pose x0\mathbf{x}_0

Training and Inference

During training, we freeze the text and image encoder, then train the ALBEF fusion, the scoring network, and the consistency model end-to-end. The score network and the conditional consistency model share the same architecture. We trained both models simultaneously for 1000 epochs with a batch size of 8 using the Adam optimizer. The training time takes approximately three days on an NVIDIA A100 GPU. Regarding the parameters of the conditional consistency model, we empirically set T=1000T = 1000, ϵ=1\epsilon = 1, and N=2000N = 2000. After training the scoring network and the conditional consistency model fθ(xt,t,y)\mathbf{f}_\theta(\mathbf{x}_t,t,y), we can sample the grasp pose given the input image and language instruction prompt in a few denoising steps using our algorithm 1.

Method Overview Figure 2: Robot Hands with different ultilities.

Next

In the next post, we will evaluate the effectiveness of our proposal.

Like What You See?