Research

I study multimodal retrieval and reasoning over large-scale video data. The work centers on indexing, evidence retrieval, and evaluation for long-form collections.

What I work on

The central problem is to retrieve the right evidence from long-form video and use it for reasoning. This is difficult because video is temporally distributed, weakly structured, and queried through language, metadata, and temporal constraints. I build systems that index video into searchable units and connect retrieval outputs to grounded reasoning pipelines.

A second problem is evaluation. Reported progress on video tasks can change with frame sampling, segmentation, and benchmark construction. I study evaluation protocols that expose those dependencies and support comparison under real operational constraints.

Research pillars

Video understanding

Long-form video carries meaning across time, modalities, and context. I build shot detection, chaptering, and multimodal representation pipelines that turn raw footage into structured, machine understandable units.

Video retrieval

Multimodal video queries require systems to align language, vision, and time. I design retrieval and reasoning pipelines that ground outputs in relevant video segments.

AI fairness

Models inherit and amplify the biases present in their training data. I study how disability, minority, and representation gaps propagate through vision and language systems, and how to evaluate and mitigate them.

Selected work

Curated for the core agenda of multimodal retrieval and reasoning over video data. The complete list is below.

Paper
Towards Retrieval Augmented Generation over Large Video Libraries

Studies how to ground generation in large video libraries by retrieving relevant video segments as evidence. HSI 2024.

arXiv Hugging Face Research note DOI
Paper
Frame Sampling Strategies Matter: A Benchmark for small vision language models

Shows that frame sampling changes measured video reasoning performance and provides a benchmark for more reliable comparison.

arXiv Code Hugging Face Research note
Paper
Multimodal Chaptering for Long-Form TV Newscast Video

Addresses the lack of temporal structure in long broadcasts by building a multimodal chaptering pipeline for retrieval and analysis.

arXiv Hugging Face Colab Research note
Patent
Computer-based platforms and methods for efficient AI-based digital video shot indexing

Builds a shot-level indexing method designed for large production video platforms. US 12,288,377 (2025).

Patent record

Jump to all publications ↓

Reviewing activities

I reviewed for ICASSP 2026, ICPRAI 2026, and ICME 2025. I also served on the scientific committee of JETSAN 2025.

Teaching

Speaker Diarization — Guest lecture

2024 · Graduate course on multimodal speech and speaker recognition. Covered diarization pipelines, multi-stream voice activity detection, evaluation, and fairness in real-world conditions.

Publications

Recent work centers on multimodal retrieval and reasoning over video data. Earlier publications cover multimodal speech processing, robustness, and fairness, which inform the current research agenda. Authoritative citation data: Google Scholar.

2025

Video understanding
Frame Sampling Strategies Matter: A Benchmark for small vision language models

Marija Brkic, Anas Filali Razzouki, Yannis Tevissen, Khalil Guetari, Mounim A. El Yacoubi · arXiv:2509.14769 · arXiv

Code Hugging Face Research note
Video retrieval
Computer-based platforms and methods for efficient AI-based digital video shot indexing

Frédéric Petitpont, Philippe Petitpont, Yannis Tevissen, Khalil Guetari · US Patent 12,288,377 · Apr 2025

Patent record

2024

Video understanding
Systems and methods for AI generation of image captions enriched with multiple AI modalities

Frédéric Petitpont, Yannis Tevissen, Khalil Guetari · US Patent 12,148,233 · Nov 2024

Patent record
Video retrieval
Towards Retrieval Augmented Generation over Large Video Libraries

Yannis Tevissen, Khalil Guetari, Frédéric Petitpont · HSI 2024 · Best Presentation Paper

arXiv Hugging Face Research note DOI
Fairness
Disability Representations: Finding Biases in Automatic Image Generation

Yannis Tevissen · CVPR 2024 Workshop AVA · arXiv

Hugging Face
Video understanding
Multimodal Chaptering for Long-Form TV Newscast Video

Khalil Guetari, Yannis Tevissen, Frédéric Petitpont · 2024 · arXiv

Hugging Face Colab Research note
Video understanding
Inserting Faces inside Captions: Image Captioning with Attention Guided Merging

Yannis Tevissen, Khalil Guetari, Marine Tassel, Erwan Kerleroux, Frédéric Petitpont · arXiv:2405.02305 · arXiv

Hugging Face Dataset
Speech processing
Privacy Preserving Personal Assistant with On-Device Diarization and Spoken Dialogue System for Home and Beyond

Gérard Chollet et al. · IHIET 2024

2023

Speech processing
Diarisation multimodale: vers des modèles robustes et justes en contexte réel

Yannis Tevissen · Institut Polytechnique de Paris
Speech processing
Détection d'activité vocale Multi-flux pour la Diarisation du locuteur

Yannis Tevissen, Jérôme Boudy, Gérard Chollet, Frédéric Petitpont · GRETSI 2023
Speech processing
Home monitoring for frailty detection through sound and speaker diarization analysis

Yannis Tevissen et al. · JETSAN 2023
Fairness
Towards measuring and scoring speaker diarization fairness

Yannis Tevissen, Jérôme Boudy, Gérard Chollet, Frédéric Petitpont · arXiv:2302.09991 · arXiv

2022

Speech processing
Multi-stream voice activity detection for robust speaker diarization

Yannis Tevissen, Jérôme Boudy, Gérard Chollet · GDR ISIS 2022
Speech processing
The Newsbridge-Telecom SudParis VoxCeleb Speaker Recognition Challenge 2022 System Description

Yannis Tevissen, Jérôme Boudy, Frédéric Petitpont · VoxCeleb SRC 2022 Task 4