Multispectral Pedestrian Detection via Deep Fusion Neural Network

Introduction

Motivation: Create an improved pedestrian detection system.

Accurate detection and identification of humans is key to successful behavior analysis tasks, which also benefits the development of potential applications in surveillance and autonomous driving. The conventional visible color imaging used to detect pedestrians works great under normal circumstances where there is clear visibility to the person. Factors such as dim lighting, large distances, and cluttered backgrounds can greatly reduce accuracy. As ambient lighting has less effect on thermal imaging, thermal cameras which have long-wavelength infrared become widely used in human tracking and activity recognition. However, thermal images always lose fine visual details of human objects while the optical cameras can capture more details.

Objectives: Utilize the advantages of both visible and thermal sensors by proposing a deep fusion neural network which is based on the state of the art YOLOv3 model.

The data collected from the thermal and color cameras provide details that the other image sometimes misses. Training our fusion model using YOLOv3 provides us with accurate and fast detection networks.

Methods: Create, test, and compare several fusion models for best performance and accuracy.

We propose four models with several different points of fusion. Each individual model is trained independently on our data. The results from the models are tested, compared, and ranked on their performance. Our model was trained and tested using the Korea Advanced Institute of Science and Technology (KAIST) Multispectral Pedestrian Dataset.

Proposed Fusion Models

Early Fusion: Concatenates the feature maps from color and thermal branches immediately after the second convolutional blocks in Darknet 53. Afterwards, we introduce a 1×1 convolutional layer that reduces the dimension of concatenate layer. The output connects the rest of Darknet53 layers and YoloV3 prediction blocks

Halfway Fusion: Concatenates the feature maps from color and thermal branches after the middle convolutional blocks in Darknet 53 (11th). The same as early fusion, the output of concatenate layer connects the rest half of Darknet53 layers and YoloV3 prediction blocks. Comparing with early fusion, the fused features contain more detailed information.

Late Fusion I : Concatenates the feature maps from color and thermal branches after the output of Darknet 53. The inputs of concatenation layers are from RGB channel, thermal channel and the fused channel.

Late Fusion II : Combines Halfway Fusion and Late Fusion I. The prediction blocks get features from different single channel layers and fusion layers.

Results

Our results showed that the halfway fusion performed the best out of all four of our proposed models and all of our models out performed the baseline for pedestrian detection. Late Fusion v2 is an interesting case that we can also focus on and create related new architectures. Our research overall confirmed the beneficial gains of introducing new sensor information to single-sensor conventional pedestrian detection methods.