Dataset. We use the Grasp-Anything dataset in our experiment. Grasp-Anything is a large-scale dataset for language-driven grasp detection with 1M samples. Each image in the dataset is accompanied by one or several prompts describing a general object grasping action or grasping an object at a specific location.
Evaluation Metrics. Our primary evaluation metric is the success rate, defined similarly to previous works, necessitating an IoU score of the predicted grasp exceeding 25% with the ground truth grasp and an offset angle less than 30∘. We also use the harmonic mean (`H') to measure the overall success rates. All methods' latency (inference time) in seconds is reported using the same NVIDIA A100 GPU.
Table 1: Comparision with Traditional Grasp Detection Methods.
We compare our LLGD with GR-CNN, Det-Seg-Refine, GG-CNN, CLIPORT, MaskGrasp, and CLIP-Fusion. Table 1 compares our method and other baselines on the GraspAnything dataset. This table shows that our proposed LLGD outperforms traditional grasp detection methods by a clear margin. Our inference time is also competitive with other methods.
Table 2: Comparison with Diffusion Models for Language-Driven Grasp Detection.
In this experiment, we compare our LLGD with other diffusion models for language-driven grasp detection. In particular, we compare with LGD using DDPM, and recent lightweight diffusion works: SnapFusion with 500 timesteps and LightGrad with 250 timesteps.
Table 2 shows the result of diffusion models for language-driven grasp detection. We can see that the accuracy and inference time of the classical diffusion model LGD strongly depend on the number of denoising timesteps. LGD with 1000 timesteps achieves reasonable accuracy but has significant long latency. Lightweight diffusion models such as SnapFusion and LightGrad show reasonable results and inference speed. However, our method achieves the highest accuracy with the fastest inference speed.
Figure 1: Consistency model analysis. With text prompt input "Grasp the cup at its handle", we compare the trajectory grasp pose of our method and LGD. In the figure, the top row illustrates the trajectory of LGD, while the bottom row corresponds to the trajectory of our LLGD.
In this analysis, we will verify the effectiveness of our conditional consistency model. In Figure 1, we visualize the grasp pose aspect to time index t. In the LGD model, as the discrete diffusion model is employed with T=1000, we have to perform the diffusion steps with a step size of 1, which results in a very slow inference speed. Moreover, the grasp pose trajectory still exhibits significant fluctuations. Our method can arbitrarily select boundary time points for the continuous consistency model. It is evident that the number of iterations required by our method is significantly less than that of LGD for the exact value of T, which contributes to the "lightweight" factor. Furthermore, the grasp pose at t=603 has almost converged to the ground truth, while LGD using DDPM at t=350 has not yet achieved a successful grasp.
Figure 2: Visualization of detection results of different language-driven grasp detection methods.
Visualization. Figure 2 shows qualitative results of our method and other baselines. The outcomes suggest that our method LLGD generates more semantically plausible grasp poses given the same text query than other baselines. In particular, other methods usually show grasp poses at a location not well-aligned with the text query, while our method shows more suitable detection results.*
Figure 3: In the wild detection results. Images are from the internet.
In the Wild Detection. Figure 3 illustrates the outcomes of applying our method to random images from the internet. The results demonstrate that our LLGD can effectively detect the grasp pose given the language instructions on real-world images. Our method showcases a promising zero-shot learning ability, as it successfully interprets grasp actions on images it has never encountered during training.
Figure 4: Prediction failure cases.
Failure Cases. Although promising results have been achieved, our method predicts incorrect grasp poses. Many objects and grasping prompts pose a challenging problem as the network cannot capture all the diverse circumstances that arise in real life. Figure 4 depicts some failure cases where LLGD incorrectly predicts the results, which can be attributed to multiple similar objects that are difficult to distinguish and text prompts that lack detailed descriptions for accurate result determination.
Robotic Setup. Our lightweight language-driven grasp detection pipeline is incorporated within a robotic grasping framework that employs a KUKA LBR iiwa R820 robot to deliver quantifiable outcomes. Utilization of the RealSense D435i camera enables the translation of grasping information from LLGD into a 6DoF grasp posture, bearing resemblance to previous works. Subsequently, a trajectory optimization planner is used to execute the grasping action. Experiments were conducted on a table surface for the single object scenario and the cluttered scene scenario, wherein various objects were placed to test each setup. Table 3 shows the success rate of our method and other baseline models.
Our method outperforms other baselines in both single object and cluttered scenarios. Furthermore, our lightweight model allows rapid execution speed without sacrificing the accuracy of visual grasp detection.
Limitation. Despite achieving notable results in real-time applications, our method still has limitations and predicts incorrect grasp poses in challenging real-world images. Faulty grasp poses are often due to the correlation between the text and the attention map of the visual features not being well-aligned as Figure 4. From our experiment, we see that when grasp instruction sentences contain rare and challenging nouns that are popular in the dataset, ambiguity in parsing or text prompts occurs, which is usually the main cause of incorrect predictions of grasp poses. Therefore, providing the instruction prompts with clear meanings is essential for the robot to understand and execute the correct grasping action.
Future work. We see several prospects for improvement in future work:
1. Expanding our method to handle 3D space is essential, implementing it for 3D point clouds and RGB-D images to avoid the lack of depth information in robotic applications.
2. Addressing the gap between the semantic concept of text prompts and input images, analyzing the detailed geometry of objects to effectively distinguish between items with similar structures.
3. Expanding the problem to more complex language-driven manipulation applications. For instance, if the robot wants to grasp a plate containing apples, it would need to manipulate the objects in such a manner that prevents the apples from falling.
Given an input RGB image and a text prompt describing the object of interest, we aim to detect the grasping pose on the image that best matches the text prompt input. We follow the popular rectangle grasp convention widely used in previous work to define the grasp.
In the diffusion model, we represent the target grasp pose as x0. The objective of our diffusion process of language-driven grasp detection involves denoising from a noisy state xT to the original grasp pose x0, conditioned on the input image and grasp instruction represented by y. The forward process in traditional conditional diffusion model is defined as:
q(xt∣xt−1)=N(1−βtxt−1,βtI),(1)
where the hyperparameter βₜ is the amount of noise added at diffusion step t ∈ [0,T] ⊆ ℝ.
To train a diffusion model with condition y, we use a neural network to learn the reverse process:
pϕ(xt−1∣xt,y)=N(μϕ(xt,t,y),Σϕ(xt,t,y)).(2)
In our approach, we utilize the diffusion process in the continuous domain, where xt is the grasp pose state at arbitrary time index t. Unlike popular discrete diffusion models, by using a continuous space, we can improve sample quality and reduce inference times due to the ability to traverse the diffusion process at arbitrary timesteps, allowing for more fine-grained control over the denoising process.
Figure 1: The overview of our method. First, the input RGB image and text prompt are fed into the feature encoder and ALBEF fusion. Subsequently, we concurrently train two models with the same architectures: A score network to estimate the probability flow Ordinary Differential Equation (ODE) trajectory for the diffusion process and a conditional consistency model to determine the grasp pose with a few denoising steps.
To reduce the inference time during the denoising step of the diffusion model, we aim to estimate the original grasp pose with just a few denoising steps. Since our language-driven grasp detection task has the condition y, we introduce a conditional consistency model based on the consistency concept to infer the original grasp pose during the inference process directly:
fθ(xt,t,y)={xtFθ(xt,t,y)t∈[0,ϵ]t∈(ϵ,T],(3)
where fθ(xϵ,t,y)=xϵ is the boundary condition, and Fθ(xt,t,y) is a free-form deep neural network whose output has the same dimensionality as xt.
To train our conditional consistency model, we employ knowledge distillation from a continuous diffusion process:
dxt=−21γtxtdt+γtdwt,(4)
where γt is a non-negative function referred to as the noise schedule, and wt is the standard Brownian motion. This forward process creates a trajectory of grasp poses {xt}t=0T. The grasp pose state xt depends on the time index t and the input image and text prompt. The grasp distribution p(x0∣y) from the dataset is transformed into p(xT∣y)∼N(0,I). Given the ground truth grasp pose x0, we can sample xt at arbitrary t:
p(xt∣x0)=N(μt,Σt),(5)
where
μt=e21ρtx0,Σt=(1−eρt)I,ρt=−∫0tγsds.
The equation (4) is a probability flow ODE. With the conditional variable y, it can be redefined as:
dtdxt=−21γt[xt+∇logp(xt∣y)],(6)
where ∇logp(xt∣y) is the score function of the conditional diffusion model.
Suppose that we have a neural network sϕ(xt,t,y) that can approximate the score function ∇logp(xt∣y), i.e., sϕ(xt,t,y)≈∇logp(xt∣y). After training the score network, we can replace the ∇logp(xt∣y) term in the equation (6) with a neural network:
dtdxt=−21γt[xt+sϕ(xt,t,y)].(7)
Score Function Loss. In order to approximate the score function ∇logp(xt∣y), the conditional denoising estimator minimizes the following objective:
If y and t are fixed, we can define a transition probability that does not depend on these variables, q(x0)=p(x0∣y), κ(xt)=sϕ(xt,t,y). According to Vincent P., 2011, we have:
Discretization. Consider discretizing the time horizon [ϵ,T] into N−1 with boundary t1=ϵ<t2<t3<…<tN=T. If N is sufficiently large, we can use an ODE-solver to estimate the next discretization step:
Conditional Consistency Model Loss. To enable fast sampling, we expect that the predicted point x^ti and xti+1 to lie on the same probability flow ODE trajectory. We propose conditional consistency loss to enforce this constraint:
The input of our network is the image and a corresponding grasping text prompt represented as e (for example, "grasp the fork at its handle"). We first extract the image feature using a 12-layer vision transformer ViT image encoder. The input text prompt is encoded by a text encoder using BERT or CLIP. We then combine and learn the features of the input text prompt and input image using the ALBEF fusion network. The output of the fusion features is fed into a score network, and our conditional consistency model is used to learn the grasp pose. Figure 1 shows the detail of our network.
Score Network. In practice, we utilize a score network composed of several MLP layers to extract three components: the noisy grasp pose xt, the time index t, and the conditional vision-language embedding y. Subsequently, these features are concatenated, and the score function is extracted through a final MLP layer. It is crucial to ensure that the output dimension of the scoring network is identical to the dimension of the input xt because, fundamentally, the score function is the gradient of the grasp pose distribution given the condition y. Our conditional consistency model's network has an architecture similar to the scoring network; however, its output is the predicted grasp pose.
Algorithm 1: Inference Process
Input: Image and text prompt, conditional consistency model fθ(x,t,y), number of inference steps P, sequence of time points t1=ϵ<t2<t3<⋯<tP=T, noise scheduler αt=eρt.
During training, we freeze the text and image encoder, then train the ALBEF fusion, the scoring network, and the consistency model end-to-end. The score network and the conditional consistency model share the same architecture. We trained both models simultaneously for 1000 epochs with a batch size of 8 using the Adam optimizer. The training time takes approximately three days on an NVIDIA A100 GPU. Regarding the parameters of the conditional consistency model, we empirically set T=1000, ϵ=1, and N=2000. After training the scoring network and the conditional consistency model fθ(xt,t,y), we can sample the grasp pose given the input image and language instruction prompt in a few denoising steps using our algorithm 1.
Language-driven grasp detection is a fundamental yet challenging task in robotics with various industrial applications. This work presents a new approach for language-driven grasp detection that leverages lightweight diffusion models to achieve fast inference time. By integrating diffusion processes with grasping prompts in natural language, our method can effectively encode visual and textual information, enabling more accurate and versatile grasp positioning that aligns well with the text query. To overcome the long inference time problem in diffusion models, we leverage the image and text features as the condition in the consistency model to reduce the number of denoising timesteps during inference. The intensive experimental results show that our method outperforms other recent grasp detection methods and lightweight diffusion models by a clear margin. We further validate our method in real-world robotic experiments to demonstrate its fast inference time capability.
Grasping is one of the fundamental tasks in robotics, enabling robots to interact with the physical world through a broad spectrum of applications, from industrial automation and human-robot interaction to service robotics. Recent advancements in machine vision have significantly improved the capabilities of grasp detection for the robot. Prior research has demonstrated encouraging grasp detection results in both 2D images and 3D point clouds. However, most existing works define grasp detection as a region localization problem while ignoring the use of natural language to localize possible grasps on the object based on linguistic input.
Figure 1: Virtual Demonstration of grasping a commanded object.
With the recent advances in Large Language Models (LLM), integrating language into robotic systems has become more popular. Pretrained models such as ChatGPT and CLIP have revolutionized various applications, and their adaptability to the robotic domain has shown encouraging results. Although several language-driven robotic manipulations work, most focus on understanding high-level actions and overlook the fundamental grasping task. In this paper, we tackle the language-driven grasp detection task that allows the robot to grasp specific objects based on the language command. With language-driven grasping ability, robots can interact more effectively with the surrounding environment and humans.
Language-driven grasping offers several advantages compared to the traditional grasp detection task without text. Firstly, we communicate with robots by providing language prompts that direct them to execute precise tasks; therefore, the incorporation of natural language instructions augments robotic systems with the ability to respond to dynamic, real-time tasks interactively. Secondly, using natural language addresses the challenge of ambiguity in identifying target objects within cluttered environments or distinguishing among objects with similar shapes. Lastly, linguistic guidance enriches robotic systems with semantic information, enhancing their learning capabilities without necessitating expert demonstrations or specific engineering.
Several works on grasp detection have recently utilized diffusion models as the essential technique and shown encouraging results. This is motivated by the proven efficacy of diffusion models in conditional generation tasks such as image synthesis, image segmentation, and visual grounding. The effectiveness of diffusion models comes from their iterative approach to gradually refine data from an initial state of pure noise toward a meaningful output. Nonetheless, applying diffusion models to language-driven tasks in robotics faces a key challenge, i.e., the inference time of diffusion models is usually not fast enough for real-time robotic applications. Consequently, recent studies have introduced techniques to tackle the inference speed problem of diffusion models using approaches such as rapid sampling, knowledge distillation, or model optimization. However, these models can still not perform fast sampling with language conditions during inference to meet the real-time requirement in robotic grasping.
In this paper, we propose a new lightweight diffusion model to tackle the inference speed problem in utilizing the diffusion model for the language-driven grasp detection task. To this end, we exploit the capabilities of flow-based generative models to improve the precision of robots in identifying grasp poses from textual inputs. In particular, we develop a conditional consistency model for fast inference speed for real-time robotic applications. We verify our proposed method on a recent large-scale language-driven grasping dataset and achieve superior accuracy and inference speed compared with recent approaches. Furthermore, our method enables zero-shot learning and generalizes it to real-world robotic grasping applications.
We present Lightweight Language-driven Grasp Detection (LLGD), a fast diffusion model for language-driven grasp detection.
We conduct intensive analysis to validate our method and demonstrate that it outperforms other approaches in terms of both accuracy and execution speed.
Grasp Detection. Grasp detection has been a central topic in robotics, aiming to equip robots with the ability to identify and execute object grasping in complex environments. Several works have set the foundation for robot grasping by using convolutional neural networks (CNNs). Most previous grasp detection methods are often limited to simple tasks with a fixed number of classes and rely solely on raw image data. Several works have extended the problem by using RGB-D images or 3D point clouds to output the results in 3D space. However, they still have not focused on integrating language as the input instruction in the grasp detection problem.
Language-driven Grasping. Language-driven grasp detection introduces the use of natural language to inform grasp detection tasks. The standard approach to tackling the task of language-driven grasp detection is to divide it into a two-step process. One stage identifies the target object, and the second focuses on generating grasp poses based on the established visual-text correlations. Foundation models such as GroundDINO and CLIP have emerged, enabling zero-shot detection and segmentation. These models allow for the localization of the target object without training. However, due to their large size, they result in longer inference times. Accessing such commercial foundation models is not always possible, especially since LLM models often require using APIs, which come at a high cost.
Lightweight Diffusion Model. Lightweight diffusion models that maintain performance while reducing computational overhead have become crucial in machine learning. Researchers have utilized knowledge distillation for low-resolution features to reduce the number of parameters in U-Net. Recently, consistency models have surfaced as a robust approach of generative models capable of producing high-quality images within a single or a limited number of steps. Although there are significant applications in generative tasks, these models are primarily unconditional. On the other hand, robotic applications remain discriminative, making the use of unconditional diffusion models not entirely suitable. In this study, we address this issue by building a lightweight diffusion model with language conditions. We aim to enhance the consistency model work to inherit its fast inference time while adding the language conditions to make it more suitable for the language-driven grasping task.
Figure 2: GraspNet Dataset, a widely used data for Grasp Detection.