ベンチマーク

VideoNorms解説:文化理解度を測るAI

紹介論文今回紹介する論文はVideoNorms: Benchmarking Cultural Awareness of Video Language Modelsという論文です。この論文を一言でまとめるとVideoNorms論文を解説。A...

2025.10.11

論文要約IT・プログラミング

紹介論文今回紹介する論文はArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluationという論文です。この論文を一言でまとめるとAre...

2025.10.10

論文要約IT・プログラミング

紹介論文今回紹介する論文はAudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMsと...

2025.10.10

論文要約IT・プログラミング

紹介論文今回紹介する論文はAgent Bain vs. Agent McKinsey: A New Text-to-SQL Benchmark for the Business Domainという論文です。この論文を一言でまとめるとビジネ...

2025.10.09

論文要約IT・プログラミング

紹介論文今回紹介する論文はInfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agentsという論文です。この論文を一言でまとめる...

2025.10.05

論文要約IT・プログラミング

紹介論文今回紹介する論文はDeconstructing Self-Bias in LLM-generated Translation Benchmarksという論文です。この論文を一言でまとめるとLLMによる自動翻訳ベンチマーク作成の自己...

2025.10.03

論文要約IT・プログラミング

紹介論文今回紹介する論文はVoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewingという論文です。この論文を一言でまとめ...

2025.09.29

論文要約IT・プログラミング

紹介論文今回紹介する論文はDRES: Benchmarking LLMs for Disfluency Removalという論文です。この論文を一言でまとめると会話理解を阻害する「言い淀み」。DRESベンチマークでLLMの除去性能を徹底評...

2025.09.26

論文要約IT・プログラミング

紹介論文今回紹介する論文はDRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models' Understanding on Indian Cultur...

2025.09.24

論文要約IT・プログラミング

紹介論文今回紹介する論文はAn Evaluation-Centric Paradigm for Scientific Visualization Agentsという論文です。この論文を一言でまとめると科学的可視化エージェントの評価パラダイ...

2025.09.21

論文要約IT・プログラミング