This repository contains my Master's thesis research at Tokyo Institute of Technology (Tokyo Tech), focusing on multimodal video retrieval and emotion recognition. It includes a comprehensive collection of multimodal video retrieval and emotion recognition models, along with their evaluation results and comparative analysis.
Graduation/
βββ ADEPT/ # Our proposed model - Adaptive Dialogue-Enhanced Parameter Tuning
βββ MERLIN/ # MERLIN baseline - Multimodal Embedding Refinement via LLM-based Iterative Navigation
βββ Clip4Clip/ # Clip4Clip baseline for video-text retrieval
βββ Emotion-LLaMA/ # Emotion-LLaMA for multimodal emotion recognition
βββ IVR-QA-baselines/ # Interactive Video Retrieval with Questions and Answers baseline
βββ llava_next_video_deepseek/ # LLaVA-NeXT video understanding model
βββ output_analysis/ # Comprehensive analysis of all model outputs
Paper: Adaptive Dialogue-Enhanced Parameter Tuning for Multimodal Video Retrieval
Description: Our proposed model that extends MERLIN with adaptive parameter tuning mechanisms. It introduces entropy analysis strategies to automatically select ASK vs REFINE decisions for affective video retrieval tasks.
Key Features:
- Adaptive parameter tuning for optimal (m, Ξ±, Ξ²) combination
- Entropy-based strategy selection (ASK vs REFINE)
- Specialized optimization for affective video datasets (MAFW, MER2024)
- Two-phase approach: parameter tuning and best parameter testing
Best Parameters:
- MAFW: m=12, Ξ±=0.0075, Ξ²=0.062
- MER2024: m=4, Ξ±=0.006, Ξ²=0.007
Paper: Multimodal Embedding Refinement via LLM-based Iterative Navigation
Description: The original MERLIN framework that ADEPT is based on. It uses LLM-based iterative navigation for text-video retrieval-rerank pipeline.
Components:
- Questioner: Generates questions about video content
- Reranker: Reorders candidate videos using Google Vertex AI
- Answerer: Simulates human agent interactions
Paper: CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
Description: A baseline model for video-text retrieval using CLIP architecture. It extends CLIP to handle video sequences by processing multiple frames.
Implementation:
mafw.py: MAFW dataset evaluationmer2024.py: MER2024 dataset evaluation- Results stored in
metrics_*.txtfiles
Paper: Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning
Description: A multimodal emotion recognition model that integrates audio, visual, and textual inputs through emotion-specific encoders.
Key Features:
- MERR dataset with 28,618 coarse-grained and 4,487 fine-grained samples
- Emotion-specific encoders for audio, visual, and textual inputs
- Instruction tuning with modified LLaMA model
- Top performance on EMER, MER2023, and DFEW datasets
Performance:
- Clue Overlap: 7.83, Label Overlap: 6.25 on EMER
- F1 score: 0.9036 on MER2023 challenge
- UAR: 45.59, WAR: 59.37 on DFEW dataset
Paper: Simple Baselines for Interactive Video Retrieval with Questions and Answers
Description: ICCV'2023 paper presenting simple yet effective baselines for interactive video retrieval via question-answering.
Key Features:
- Interactive video retrieval through question-answering
- VideoQA model to simulate user interactions
- Simple but effective baselines for interactive retrieval
- Evaluated on MSR-VTT, MSVD, and AVSD datasets
Methods:
- Heuristic approach
- Auto-text generation
- Auto-text-video combination
Repository: LLaVA-NeXT
Description: Advanced video understanding model based on LLaVA-NeXT architecture for multimodal video analysis.
Implementation:
deepseek_interaction.py: DeepSeek-based video interactionichise_interaction.py: Ichise-based video interactionllava_next_video_caption.py: Video caption generation
The output_analysis/ directory contains comprehensive evaluation results and comparative analysis:
output_analysis/
βββ analysis.ipynb # Main analysis notebook
βββ MAFW/ # MAFW dataset results
βββ MER2024/ # MER2024 dataset results
βββ entropy_analysis_outputs/ # Entropy analysis results
- Recall@1/5/10: Retrieval accuracy at different ranks
- Entropy Analysis: Inter-cluster and intra-cluster entropy
- Parameter Optimization: Best parameter combinations for each model
- Comparative Performance: Cross-model performance comparison
- Type: Affective video dataset
- Focus: Emotional video content analysis
- Usage: Primary dataset for affective video retrieval evaluation
- Type: Multimodal emotion recognition dataset
- Focus: Emotion recognition across multiple modalities
- Usage: Secondary dataset for comprehensive evaluation
- Python 3.8+
- PyTorch
- Transformers
- Google Cloud Vertex AI API
- OpenAI API
-
ADEPT Model:
cd ADEPT python run_adept.py --dataset mafw --data_path /path/to/data --num_rounds 5 -
Parameter Tuning:
cd ADEPT/tuning python parameter_tuning.py --dataset mafw --data_path /path/to/data -
Baseline Models:
- Clip4Clip: Run
mafw.pyormer2024.py - Emotion-LLaMA: Follow the original repository setup
- IVR-QA: Use
eval_interactive.pywith appropriate configs - LLaVA-NeXT: Use the provided interaction scripts
- Clip4Clip: Run
- ADEPT: Adaptive parameter tuning with entropy analysis
- MERLIN: Original framework without parameter optimization
- Clip4Clip: Standard video-text retrieval baseline
- Emotion-LLaMA: Specialized for emotion recognition
- IVR-QA: Interactive retrieval with question-answering
- LLaVA-NeXT: Advanced multimodal video understanding
- ADEPT shows improved performance through adaptive parameter tuning
- Entropy analysis helps in better strategy selection
- Affective video datasets benefit from specialized optimization
- Interactive approaches generally outperform single-shot retrieval
- Adaptive Parameter Tuning: Novel approach to automatically find optimal parameters for different datasets
- Entropy Analysis: Systematic analysis of cluster entropy for strategy selection
- Affective Video Retrieval: Specialized optimization for emotional content
- Comprehensive Evaluation: Multi-model comparison on affective video datasets
If you use this code or find it helpful, please cite the relevant papers:
@article{merlin2024,
title={Multimodal Embedding Refinement via LLM-based Iterative Navigation},
author={Original Authors},
journal={arXiv preprint},
year={2024}
}
@inproceedings{clip4clip2021,
title={CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval},
author={Original Authors},
booktitle={arXiv preprint},
year={2021}
}
@article{emotionllama2024,
title={Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning},
author={Original Authors},
journal={arXiv preprint},
year={2024}
}
@inproceedings{ivrqa2023,
title={Simple Baselines for Interactive Video Retrieval with Questions and Answers},
author={Liang, Kaiqu and Albanie, Samuel},
booktitle={ICCV},
year={2023}
}