BLEU和SPICE：机器翻译与图像描述的自动评估方法详解

在机器翻译和图像描述领域，评估模型性能是一个重要且复杂的问题。为了有效地评估机器生成的文本质量，我们需要一种自动、快速且可靠的方法。BLEU（Bilingual Evaluation Understudy）和SPICE（Semantic Propositional Image Caption Evaluation）是其中两种广泛使用的方法，本文将详细介绍它们的工作原理及其优缺点。

BLEU：机器翻译评估方法

1. BLEU的基本原理

BLEU是一种自动化的机器翻译评估方法，通过计算候选翻译（candidate）和参考翻译（reference）之间的n-gram匹配来评估翻译质量。n-gram是指将句子中相邻的n个单词组合在一起的方式，例如：

1-gram：{it, is, a, nice, day, today}
2-gram：{it is, is a, a nice, nice day, day today}
3-gram：{it is a, is a nice, a nice day, nice day today}

2. n-gram相似度计算

候选翻译和参考翻译的n-gram匹配度通过以下方式计算：

假设候选句子为 “It is a nice day today”，参考句子为 “Today is a nice day”。我们可以得到以下匹配情况：

1-gram相似度：{it, is, a, nice, day, today} vs {today, is, a, nice, day}，匹配项为 {is, a, nice, day}，相似度为 p1 = 5/6 = 0.83。
2-gram相似度：{it is, is a, a nice, nice day, day today} vs {today is, is a, a nice, nice day}，匹配项为 {is a, a nice, nice day}，相似度为 p2 = 3/5 = 0.6。
3-gram相似度：{it is a, is a nice, a nice day, nice day today} vs {today is a, is a nice, a nice day}，匹配项为 {is a nice, a nice day}，相似度为 p3 = 2/4 = 0.5。

3. 特殊情况处理

情况1：重复单词

对于候选句子 “the the the the” 和参考句子 “The cat is standing on the ground”。如果不加以限制，1-gram相似度 p1 会等于 1。为了解决这个问题，需要对单词计数进行截断，计数取候选句子和参考句子中的最小值。这样计算得到的 p1 = 1/4。

情况2：长度惩罚因子

候选句子 “a nice day” 和参考句子 “Today is a nice day” 的匹配度可能会因为句子长度较短而偏高。为了避免这种现象，BLEU 引入了长度惩罚因子（Brevity Penalty，BP）。计算公式如下：

$\begin{cases} 1 & \text{if } c > r \\ e^{(1 - r/c)} & \text{if } c \leq r \end{cases}$

其中，c 表示候选翻译的长度，r 表示参考翻译的长度。

4. BLEU最终公式

为了平衡各阶n-gram统计量的作用，通常取 N = 4，并对各阶统计量进行加权求和，权重 $w_n$ 取 $1/ n$ 。BLEU最终计算公式如下：

$\cdot \exp \left( \sum_{n=1}^{N} w_n \log p_n \right)$

5. BLEU代码示例

以下是一个计算BLEU得分的Python代码示例：

from nltk.translate.bleu_score import sentence_bleu

reference = [['today', 'is', 'a', 'nice', 'day']]
candidate = ['it', 'is', 'a', 'nice', 'day', 'today']
score = sentence_bleu(reference, candidate)
print('BLEU score:', score)

SPICE：图像描述评估方法

1. SPICE的基本原理

BLEU在某些情况下表现不佳，例如“A shiny metal pot filled with some diced veggies.”和“The pan on the stove has chopped vegetables in it.”意思相近，但BLEU得分可能较低。为了弥补这一不足，SPICE通过比较语义命题内容（semantic propositional content）来提升图像描述的评估性能。

2. Scene Graphs

SPICE通过对比候选描述和参考描述的scene graphs来进行评估。scene graph 表征了对象（object）、对象的属性（attribute）以及对象之间的关系（relation）。如下图所示：

3. F-score计算

定义候选描述为c，参考描述集合为S={s1, s2, …, sn}，G()表示scene graph，T()表示从scene graph中得到的logical tuples。例如，图1的logical tuple为：

{(girl), (court), (girl, young), (girl, standing), (court, tennis), (girl, on-top-of, court)}

定义精度P和召回R，SPICE的相似度计算如下：

$\frac{|T(G(c)) \cap T(G(S))|}{|T(G(c))|}$
$\frac{|T(G(c)) \cap T(G(S))|}{|T(G(S))|}$
$\frac{2PR}{P + R}$

其中 $(T x T)$ 方法表示从两个scene graph中返回matching tuples，通过wordnet数据同义词查找得到。

4. SPICE代码示例

以下是一个计算SPICE得分的Python代码示例：

from spice import spice_score

reference = "The pan on the stove has chopped vegetables in it."
candidate = "A shiny metal pot filled with some diced veggies."
score = spice_score(reference, candidate)
print('SPICE score:', score)

参考文献

BLEU: a method for automatic evaluation of machine translation.
SPICE: Semantic Propositional Image Caption Evaluation.
Meteor Universal: Language Specific Translation Evaluation for Any Target Language.