Video recognition and categorization (Part 6 - A basic tutorial)

A video is a sequence of frames or images that are played one after the other. The most of videos we see in our daily life have a frame rate of greater than 30 frames per second. As a result, even for short movies, we must deal with a considerable amount of data in comparison to image classification. Because the images are closely connected, skipping intermediate frames and processing fewer frames per second is usual.

In this post, we'll show you how to use PyTorchVideo models, datasets, and transformations to create a basic video classification training pipeline. We will train video classification task on Kinetics dataset using 3D ResNet and a standard video transform augmentation. Because PyTorchVideo lacks training code, we'll use on PyTorch Lightning, a lightweight PyTorch training framework.

Install Necessary Libraries and Data Preparation

Please install necessary libraries by using following comands:

!pip install torch pytorch_lightning pytorchvideo
!pip install --upgrade youtube-dl

The dataset must first be prepared. To train our model, we'll need a training dataset, as well as a test or validation dataset to evaluate it. For this purpose, we will use Kinetics dataset - A collection of large-scale, high-quality datasets of URL links of up to 650,000 video clips that cover 400/600/700 human action classes, depending on the dataset version. The videos include human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging. Each action class has at least 400/600/700 video clips. Each clip is human annotated with a single action class and lasts around 10 seconds.

Fig-1

Figure 1:A few sample videos in Kinetics dataset.

To prepare the Kinetics dataset, please following below steps:

First, clone this repository and make sure that all the submodules are also cloned properly.
```
!git clone https://github.com/activitynet/ActivityNet.git
cd ActivityNet/Crawler/Kinetics
```

Second, download a dataset split by calling:

mkdir "data_dir"
python download.py "dataset_split".csv "data_dir"

In here, we choose Kinetics 400 version.

mkdir kinetics-400
!python download.py "data/kinetics-400_train.csv" "kinetics-400"
!python download.py "data/kinetics-400_val.csv" "kinetics-400"
!python download.py "data/kinetics-400_test.csv" "kinetics-400"

Define Dataloader

In this step, we define the Kinetic dataloader which contains train and validate set and corresponding transforms.

import os
import pytorch_lightning
import pytorchvideo.data
import torch.utils.data

from pytorchvideo.transforms import (
    ApplyTransformToKey,
    Normalize,
    RandomShortSideScale,
    RemoveKey,
    ShortSideScale,
    UniformTemporalSubsample
)

from torchvision.transforms import (
    Compose,
    Lambda,
    RandomCrop,
    RandomHorizontalFlip
)

class KineticsDataModule(pytorch_lightning.LightningDataModule):

  # Dataset configuration
  # Insert the data path in here
  _DATA_PATH = "ActivityNet/Crawler/Kinetics/kinectics-400"
  _CLIP_DURATION = 2  # Duration of sampled clip for each video
  _BATCH_SIZE = 8
  _NUM_WORKERS = 8  # Number of parallel processes fetching data

  def train_dataloader(self):
    """
    Create the Kinetics train partition from the list of video labels
    in {self._DATA_PATH}/train
    """
    train_transform = Compose(
            [
            ApplyTransformToKey(
              key="video",
              transform=Compose(
                  [
                    UniformTemporalSubsample(8),
                    Lambda(lambda x: x / 255.0),
                    Normalize((0.45, 0.45, 0.45), (0.225, 0.225, 0.225)),
                    RandomShortSideScale(min_size=256, max_size=320),
                    RandomCrop(244),
                    RandomHorizontalFlip(p=0.5),
                  ]
                ),
              ),
            ]
        )
    train_dataset = pytorchvideo.data.Kinetics(
        data_path=os.path.join(self._DATA_PATH, "train"),
        clip_sampler=pytorchvideo.data.make_clip_sampler("random", self._CLIP_DURATION),
        transform=train_transform
    )
    return torch.utils.data.DataLoader(
        train_dataset,
        batch_size=self._BATCH_SIZE,
        num_workers=self._NUM_WORKERS,
    )

  def val_dataloader(self):
    """
    Create the Kinetics validation partition from the list of video labels
    in {self._DATA_PATH}/val
    """
    val_transform = Compose(
            [
            ApplyTransformToKey(
              key="video",
              transform=Compose(
                  [
                    UniformTemporalSubsample(8),
                    Lambda(lambda x: x / 255.0),
                    Normalize((0.45, 0.45, 0.45), (0.225, 0.225, 0.225)),
                  ]
                ),
              ),
            ]
        )
    val_dataset = pytorchvideo.data.Kinetics(
        data_path=os.path.join(self._DATA_PATH, "val"),
        clip_sampler=pytorchvideo.data.make_clip_sampler("uniform", self._CLIP_DURATION),
        transform=val_transform
    )
    return torch.utils.data.DataLoader(
        val_dataset,
        batch_size=self._BATCH_SIZE,
        num_workers=self._NUM_WORKERS,
    )

Model Implementation

Simple, repeatable factory functions may be used to create all PyTorchVideo models and layers. Because the args don't require hierarchical configs, we call this the "flat" model interface. Below is an example of how to create a default ResNet.

import pytorchvideo.models.resnet

def make_kinetics_resnet():
  return pytorchvideo.models.resnet.create_resnet(
      input_channel=3, # RGB input from Kinetics
      model_depth=50, # For the tutorial let's just use a 50 layer network
      model_num_class=400, # Kinetics has 400 classes so we need out final head to align
      norm=nn.BatchNorm3d,
      activation=nn.ReLU,
  )

Next step, we define the train and validation code, and the optimizer.

import torch
import torch.nn as nn
import torch.nn.functional as F

class VideoClassificationLightningModule(pytorch_lightning.LightningModule):
  def __init__(self):
    super().__init__()
    self.model = make_kinetics_resnet()
    
  def forward(self, x):
    return self.model(x)

  def training_step(self, batch, batch_idx):
    # The model expects a video tensor of shape (B, C, T, H, W), which is the
    # format provided by the dataset
    y_hat = self.model(batch["video"])

    # Compute cross entropy loss, loss.backwards will be called behind the scenes
    # by PyTorchLightning after being returned from this method.
    loss = F.cross_entropy(y_hat, batch["label"])

    # Log the train loss to Tensorboard
    self.log("train_loss", loss.item())

    return loss

  def validation_step(self, batch, batch_idx):
    y_hat = self.model(batch["video"])
    loss = F.cross_entropy(y_hat, batch["label"])
    self.log("val_loss", loss)
    return loss

  def configure_optimizers(self):
    """
    Setup the Adam optimizer. Note, that this function also can return a lr scheduler, which is
    usually useful for training video models.
    """
    return torch.optim.Adam(self.parameters(), lr=1e-1)

Model Training

We put all everything in a train() function which incluces a model, a dataloader and a trainer.

def train():
  classification_module = VideoClassificationLightningModule()
  data_module = KineticsDataModule()
  trainer = pytorch_lightning.Trainer(gpus=1)
  trainer.fit(classification_module, data_module)

Finally,

train()

Hopefully with the video classification series, you can understand and have the most overview about this task so that you can develop with bolder ideas.

#Install Necessary Libraries and Data Preparation

#Define Dataloader

#Model Implementation

#Model Training

Install Necessary Libraries and Data Preparation

Define Dataloader

Model Implementation

Model Training