Abnormal Human Activity Recognition (Part 7 - Deep features based action description)

A review about Abnormal Human Activity Recognition (Deep features based) (cont.).
type: insightlevel: easyguides: abnormal_activity_recognition

Deep features based action description.

It has been noted that the idea of feature engineering is changing over time from 2D features to 3D features in order to enhance the representation of actions. One of the main reasons to shift features from shallow to deep regions is the complexity involved in designing handcrafted features, which has elevated the practical applicability of action recognition algorithms to a higher level of excellence by applying deep learning knowledge to the recognition systems.

Although deep learning and its architectures have been there since the 1980s (Koohzadi and Charkari, 2017), they haven't been able to perform to their full potential because of a lack of data and computing power. As the first real-world successful implementation of CNN for handwritten digit recognition, LeNet (Lecun et al., 1998) was developed in 1998. However, with the availability of large datasets and hardware resources, various deeper architectures have been reported in (Russakovsky et al., 2014) and are being used in a variety of application areas, including computer vision (Herath et al., 2017; Paul et al., 2013; Taylor et al., 2010), speech recognition (Vesperini et al., 2018), brain-computer interaction (Cecotti and Graser, 2010), and natural language processing (Yin et al., 2017).

Deep models build high-level features out of low-level ones to learn a hierarchy of characteristics. Initially used on 2D images for visual object segmentation (Iannizzotto et al., 2005) and recognition (Wang et al., 2018; Uddin et al., 2017) tasks, CNN is a type of deep model composed of neurons and learnable weights and biases. Later, several researchers used CNN to test their ideas about how to discern activity in individual video frames by treating them as still images. This method, however, could only be used to learn spatial information. By converting the 2D CNN to a 3D CNN, several authors (Ji et al., 2013) attempted to add temporal information. In order to encode the motion information with spatial one, 3D CNN employs 3D convolution in CNN convolution layers by utilizing a 3D kernel to many contiguous frames that have been stacked together. The concept of multi-stream CNNs (Simonyan and Zisserman, 2014) based action recognition is further developed in subsequent years by allowing the deep recognition system to analyze multiple sets of inputs, including RGB image, optical flow (Simonyan and Zisserman, 2014) , dynamic images (Jing et al., 2017), and depth images. This has strengthened feature description of an action to a higher level. On the other hand, long short term memory (LSTM) has become one of the most well-liked unsupervised models that learns temporal frame arrangements and forecasts time series data (Liu et al., 2018).

Single person based abnormal human action recognition

Table-1

Table 1: Effectiveness of wearable devices and visual systems for fall detection.

Only a little amount of work has been done in terms of using deep architectures to the recognition of anomalous human action for AAL and smart homes. Deeper architectures must be applied in order to create stronger algorithms (Zhang et al., 2017) for applications like AAL and smart homes that recognize single person-specific aberrant behaviours. The single person based aberrant human action recognition work is presented here. Table 1 summarizes the generalized deep frameworks for human action recognition.

Fig-1

Figure 1: (a) LSTM network hidden layers containing LSTM cells and a final Softmax layer at the top. (b) bi-directional LSTM network with two parallel tracks in both future direction (green) and to the past (red). (c) Convolutional networks that contain layers of convolutions and max-pooling, followed by fully-connected layers and a Softmax group. (d) Fully connected feed-forward network with hidden (ReLU) layers.

The typical behavioral changes of dementia (Alzheimer's, Parkinson's disease) include disturbed sleep, difficulties walking, and failure to perform activities. Elderly people may be monitored in smart homes to spot these differences early on, which may be more helpful than a medical diagnosis. A significant obstacle to a patient's non-medical diagnosis is the lack of real-world data on dementia sufferers. (Arifoglu and Bouchachia, 2017) addressed the issue of abnormal behavior detection for elderly people with dementia using three variants of recurrent neural networks (RNNs): Vanilla RNNs (VRNN), Long Short-Term RNNs (LSTM), and Gated Recurrent Unit RNNs (GRU), see Fig. 1. They did so by introducing an approach to generate synthetic data to represent the behavior of dementia patients. (Park et al., 2018) recently used the idea of residual-recurrent neural networks to assess the sequence of activities captured in ambient intelligent environments, such as smart homes and cities with various sensors (ResidualRNN). The comprehensive trials support the claim that, in terms of recognition accuracy, the proposed model (in Fig. 2) outperforms LSTM and Gated Recurrent Units (GRU). However, in terms of computational speed, GRU performs marginally better. In a different study by (Hammerla et al., 2016), movement data obtained from wearable sensors is used to train LSTM, bi-directional LSTM, and convolutional networks to detect abnormal actions of the patient under observation.

Fig-2

Figure 2: Residual-RNN structure for activity recognition in smart homes.

Given that deep models are outperforming current state-of-the-arts in almost all application areas, including face recognition, handwriting detection (Tolosana et al., 2018), and audio signal processing, several strategies are found that use pre-trained, fully-connected networks for automatic assessment of Parkinson's disease (Hammerla and Plotz, 2015), replace emission models, i.e., GMM, RF in HMMs with Hidden Markov Models (HMMs), and DNNs were used to create a mobile audio sensing framework (Zhang et al., 2015; Alsheikh et al., 2016). These works (Hammerla and Plotz, 2015; Zhang et al., 2015; Alsheikh et al., 2016) only use the extremely rare and labeled accelerometer and wearable sensor data. It severely restricts the approaches' applicability in situations when vision sensors are utilized in the actual world.

In the next blog, we will discuss about "Multiple persons based abnormal event detection" works in detail.