Fire Detection via Effective Vision Transformers
Fire Detection via Effective Vision Transformers
- 한국차세대컴퓨팅학회
- 한국차세대컴퓨팅학회 논문지
- 17(5)
-
2021.1021 - 30 (10 pages)
-
DOI : http://dx.doi.org/10.23019/kingpc.17.5.202110.002
- 0
In today"s modern age, smart and safe cities are one of the major concerns of the research community. The cities are surrounded by open areas, agricultural land, and forests, where fire incidence can make human lives threatening, damaging their properties as well. Recently, vision sensors-based fire detection has attracted computer vision domain experts, where the leading performance is achieved by a variety of convolution neural networks (CNN) in the recent literature. However, these techniques are translation invariant, locality-sensitive, and lacking a global understanding of images. Furthermore, CNN-based models use the pooling layers strategy for dimensionality reduction to reduce the computational cost but it also loses a lot of meaningful information such as the precise location of the most active feature detector. To overcome these problems, in this work, we developed Vision Transformers (ViT) based model for fire detection. The ViT split the input image into image patches and then feed these patches to the transformer in a sequence structure similar to word embeddings. We evaluate the performance of the proposed work on the benchmark fire dataset and achieve good results when compared to state-of-the-art (SOTA) CNN methods.
In today"s modern age, smart and safe cities are one of the major concerns of the research community. The cities are surrounded by open areas, agricultural land, and forests, where fire incidence can make human lives threatening, damaging their properties as well. Recently, vision sensors-based fire detection has attracted computer vision domain experts, where the leading performance is achieved by a variety of convolution neural networks (CNN) in the recent literature. However, these techniques are translation invariant, locality-sensitive, and lacking a global understanding of images. Furthermore, CNN-based models use the pooling layers strategy for dimensionality reduction to reduce the computational cost but it also loses a lot of meaningful information such as the precise location of the most active feature detector. To overcome these problems, in this work, we developed Vision Transformers (ViT) based model for fire detection. The ViT split the input image into image patches and then feed these patches to the transformer in a sequence structure similar to word embeddings. We evaluate the performance of the proposed work on the benchmark fire dataset and achieve good results when compared to state-of-the-art (SOTA) CNN methods.
(0)
(0)