Overral Summary
The existing literature extensively explores the rapid evolution of feature design strategies for video sequences, ranging from manually crafted features to deep learning-based representations. These features play a crucial role in supporting real-time, robust, and computationally efficient frameworks for recognizing abnormal human activities. The context of application varies, including domains such as fall detection, Ambient Assistive Living (AAL), homeland security, surveillance, and crowd analysis. Notably, the choice of feature design methodology adapts to the input dimensions, specifically RGB, depth, and skeleton data.
In recent years, the proliferation of infrared sensors, exemplified by the Microsoft Kinect, has significantly impacted abnormal human activity recognition using depth and skeleton sequences. However, the practical use of Kinect sensors remains confined to specific regions. Consequently, experiments predominantly employ depth sensors alone for fall detection, AAL, and smart home applications related to abnormal human action recognition. Depth and skeleton representations offer valuable view and illumination invariance, addressing critical challenges in crowded or public scenes where dynamic background changes and varying illumination occur in open areas. For scenarios involving multiple individuals and abnormal human activities, RGB images serve as a complementary source of information. Deep feature-based descriptions have outperformed hand-crafted features due to their ability to dynamically learn from video scenes, albeit at the cost of demanding substantial computational resources.
In this study, we systematically investigated publicly available datasets related to fall detection, Ambient Assistive Living (AAL), and abnormal human action recognition. Specifically, we focused on Abnormal Human Activity Recognition (AbHAR) datasets. Among these, several 3D datasets are based on Kinect technology, encompassing pose-based human activity datasets such as MoCap (Subtle Walking from CMU Mocap Dataset, 2018), MHAD (Teleimmersion Lab, 2018), and the MSRAction3D dataset (MSR Action 3D Dataset). Additionally, various human interaction datasets, including G3Di (Bloom et al., 2014), K3Hi (K3HI Kinect-based 3D Human Interaction Dataset, 2018), CONVERSE (Edwards et al., 2016), InHOUSE Dataset (Saini et al., 2017), and Fu Kinect Fall Dataset (Aslan et al., 2017), have been generated. Notably, while these datasets cover a wide range of activities, only a limited number explicitly address abnormal actions, as identified by previous works (Nguyen et al., 2016; Khan and Sohn, 2013, 2011; Han et al., 2016; Gasparrini et al., 2015; Sucerquia et al., 2017).
In the future, systems for recognizing abnormal human actions must address specific challenges to enable real-time deployment within an affordable range for common users. Such systems play a crucial role in enhancing security in everyday routines.
During our survey, we observed that both handcrafted approaches (AlNawash et al., 2016; Roshtkhari and Levine, 2013; Wang et al., 2017b; Uddina et al., 2011; Yang et al., 2016; Triantafyllou et al., 2016; Yu et al., 2013) and deep feature-based recognition systems (Li and Chuah, 2018) struggle to maintain near real-time performance due to the lack of sufficiently large real-time dataset samples for validation. This limitation not only affects system generalizability but also impacts robustness. While human skeletons provide view invariance for action recognition, the availability of abnormal actions from different viewpoints remains a significant challenge. Additionally, existing work (Diraco et al., 2010) often relies on artificially generated data, which fails to capture real-life complexities. Therefore, the development of meaningful datasets representing abnormal actions across various scenarios (e.g., office, home, coffee shop) is essential.
Contemporary researchers are actively engaged in advancing deep architectures, ranging from primary Convolutional Neural Networks (CNNs) to more complex models such as Recurrent CNNs (RCNNs), Recurrent Neural Networks (RNNs), and auto-encoders. However, the practical accessibility of efficient and deeper abnormal action recognition systems based on these deep architectures remains a critical concern for the broader user base.
While three-dimensional data has significantly improved the performance of recognition systems, the inherent computational complexity associated with processing such data remains a challenge. Specifically, the transition from three-dimensional data to depth or skeleton-based feature descriptors in real-time Abnormal Human Activity Recognition (AbHAR) systems poses the initial hurdle. Addressing this challenge involves achieving an improved true detection rate while simultaneously minimizing computational demands.
This is the final part of series "Abnormal Human Activity Recognition". Hopefully this useful information will help you better understand this topic.