ベンチマーク

RoParQ解説: LLMの弱点克服と精度向上

紹介論文今回紹介する論文はRoParQ: Paraphrase-Aware Alignment of Large Language Models Towards Robustness to Paraphrased Questionsという論...

2025.11.29

論文要約IT・プログラミング

紹介論文今回紹介する論文はOn Evaluating LLM Alignment by Evaluating LLMs as Judgesという論文です。この論文を一言でまとめるとLLMの評価方法に革命を起こす「ALIGNEVAL」。LL...

2025.11.26

論文要約IT・プログラミング

紹介論文今回紹介する論文はThink Visually, Reason Textually: Vision-Language Synergy in ARCという論文です。この論文を一言でまとめるとARC-AGIベンチマークで、視覚情報とテ...

2025.11.20

論文要約IT・プログラミング

紹介論文今回紹介する論文はAgent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anythingという論文です。この論文を一...

2025.11.06

論文要約IT・プログラミング

紹介論文今回紹介する論文はOolong: Evaluating Long Context Reasoning and Aggregation Capabilitiesという論文です。この論文を一言でまとめるとOolong論文を徹底解説。長...

2025.11.06

論文要約IT・プログラミング

紹介論文今回紹介する論文はAre Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmarkという論文です。この論文を一...

2025.10.31

論文要約IT・プログラミング

紹介論文今回紹介する論文はAstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suiteという論文です。この論文を一言でまとめるとAstaBen...

2025.10.27

論文要約IT・プログラミング

紹介論文今回紹介する論文はDialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generationという論文です。この論文を一言でまとめるとDia...

2025.10.19

論文要約IT・プログラミング

紹介論文今回紹介する論文はGenerative Universal Verifier as Multimodal Meta-Reasonerという論文です。この論文を一言でまとめるとGoogle Gemini 2.5 Proも苦戦するVi...

2025.10.16

論文要約IT・プログラミング

紹介論文今回紹介する論文はWhen Agents Trade: Live Multi-Market Trading Benchmark for LLM Agentsという論文です。この論文を一言でまとめるとLLMエージェントが金融市場でど...

2025.10.15

論文要約IT・プログラミング