Efficient transformer networks for video object detection Sustainable Machine Learning


Transformer networks have emerged as a “Swiss army knife” in machine learning, showing superior performance on a wide range of tasks, from natural language processing (e.g. ChatGPT) to image generation (DALLE). However, the widespread adoption of transformer networks for other tasks is slowed down by the high resource and energy consumption associated with these models. This challenge is particularly pronounced in transformer models for object detection in high-resolution images, such as DETR [1,2] (see Fig. 1 for an illustration of the DETR model). Scaling these models to large sizes leads to issues such as high latency and extensive memory requirements. These challenges to task efficiency are exacerbated by the presence of redundant information in image inputs and the demand for fast detection models in vision tasks.

Figure 1: Schematics of the DETR model (adapted from [1]).

To address these efficiency issues, the concept of sparsity emerges as a promising solution [3,4]. Sparsity has demonstrated its potential to enhance memory consumption, reduce model inference latency, and improve overall performance in many deep-learning models. Notably, recent observations have indicated a degree of inherent sparsity in DETR-based models.

This project aims to conduct a comprehensive analysis of DETR-like models and their main components such as the (self-)attention mechanisms, focusing on the emergence of sparsity in various model components, including network activations and parameters. The goal is to achieve increased efficiency in terms of resource consumption, speed, and performance without significant sacrifices in model capabilities.

Research Objectives:

  • Analyze the characteristics of attention mechanisms in DETR-based models.
  • Investigate the emergence of sparsity in different components of DETR-like models, including activation and parameters.
  • Optimize models for efficiency, targeting resource consumption, speed, and overall performance.
  • Evaluate the impact of sparsity-induced optimizations on model performance.
  • This research contributes to the ongoing efforts to make transformer-based models, especially those used in object detection tasks, more practical and scalable by addressing key efficiency challenges through the lens of sparsity analysis.


  • Proficiency in Python, especially with deep learning libraries like PyTorch and NumPy.
  • Completed courses, internships, or projects related to artificial intelligence or machine learning
  • Enthusiasm for learning and a strong interest in mathematically analysing attention mechanisms

References :

[1] Carion, Nicolas, et al. "End-to-end object detection with transformers." European conference on computer vision. Cham: Springer International Publishing, 2020.

[2] Zhu, Xizhou, et al. "Deformable detr: Deformable transformers for end-to-end object detection." arXiv preprint arXiv:2010.04159 (2020).

[3] Subramoney, Anand, et al. "Efficient recurrent architectures through activity sparsity and sparse back-propagation through time." The Eleventh International Conference on Learning Representations. 2022.

[4] Li, Zonglin, et al. "The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers." The Eleventh International Conference on Learning Representations. 2022.

The Institut für Neuroinformatik (INI) is a central research unit of the Ruhr-Universität Bochum. We aim to understand the fundamental principles through which organisms generate behavior and cognition while linked to their environments through sensory systems and while acting in those environments through effector systems. Inspired by our insights into such natural cognitive systems, we seek new solutions to problems of information processing in artificial cognitive systems. We draw from a variety of disciplines that include experimental approaches from psychology and neurophysiology as well as theoretical approaches from physics, mathematics, electrical engineering and applied computer science, in particular machine learning, artificial intelligence, and computer vision.

Universitätsstr. 150, Building NB, Room 3/32
D-44801 Bochum, Germany

Tel: (+49) 234 32-28967
Fax: (+49) 234 32-14210