Rethinking Visual-language Model in Face Forensic: Multi-modal Interpretable Forged Face Detector

Deepfake detection is a long-established research topic crucial for combating the spread of malicious misinformation. Unlike previous methods that provide either binary classification results or textual explanations for deepfake detection, we propose a novel method that delivers both simultaneously. Our method harnesses the multi-modal learning power of the pre-trained CLIP and the unprecedented interpretability of large language models (LLMs) to enhance both the generalization and interpretability of deepfake detection. Specifically, we introduce a multi-modal face forgery detector (M2F2-Det) that employs specially designed face forgery prompt learning, integrating zero-shot learning capabilities of the pre-trained CLIP to improve generalization to unseen forgeries. Also, M2F2-Det incorporates the LLM to provide detailed explanations for detection decisions, offering strong interpretability by bridging the gap between natural language and the subtle nuances of facial forgery detection. Empirically, we evaluate M2F2-Det for both detection and sentence generation tasks, on both of which M2F2-Det achieves state-of-the-art performance, showing its effectiveness in detecting and explaining diverse and unseen forgeries. Code and models will be released upon publication.

(a) and (b) represent conventional deepfake detectors and DDVQA-BLIP, which take an image as the input and output the fake probability and description, respectively. (c) In this work, we propose a multi-modal face forgery detector (M2F2-Det) that produces both fake probability and reasoning descriptions.

(a) The Multi-modal Face Forgery Detector (M2F2-Det) contains pre-trained CLIP image and text encoders (i.e., \( \mathcal{E}_{I} \) and \( \mathcal{E}_{T} \)), the deepfake encoder, as well as the LLM. Taking universal forgery prompts (UF-prompts) as inputs, \( \mathcal{E}_{T} \) generates global text embedding, e.g., g^T, that helps obtain the forged attention mask, e.g., M_b. The deepfake encoder utilizes the bridge adapter, i.e., \( \mathcal{E}_{A} \), for detecting face forgeries, and the LLM generates descriptions based on outputs from \( \mathcal{E}_{I} \) and learned forgery representation (F⁰). (b) In the pre-trained CLIP text encoder, we introduce trainable layer-wise forgery tokens as inputs to each Transformer encoder layer.

Forged attention maps on samples from 6 datasets.

Additional generated forged attention maps. [Key: LF: Layer-wise forgery tokens.]

Rethinking Visual-language Model in Face Forensics:
Multi-modal Interpretable Forged Face Detector

Demo Videos

Click links below each image to view results

FF++

FFIW

Wild-Deepfake

Celeb-DF

DFDC

StyleGANv2

Instant-ID

Midjourney

Real

Abstract

Introduction

Method

Experiment

Rethinking Visual-language Model in Face Forensics: Multi-modal Interpretable Forged Face Detector