Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition

实体标注常常需要耗费大量人力，标注质量常常达不到预期，“未标注实体”就是其中常见的一类问题，指的是标注者并没有把文本中的所有实体都标注出来。“Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition”发表在ICLR 2021上，利用负采样的方法有效的降低了未标注实体对模型训练的影响。

two causes of the performance degradation

未标注实体对模型训练的影响主要有两个：

使得标注实体数量减少，相当于正样本较少，不利于模型训练
模型把未标注实体当做负样本训练，导致模型学到了错误的数据

那么哪个影响是我们要解决的主要矛盾呢？

作者选了两个数据集CoNLL-2003和OntoNotes 5.0，按不同比例mask掉一些标注实体，作为有不同程度的未标注实体问题的数据集。

notation：
We denote an input sentence as $x = [x_1, x_2, \cdots, x_n]$ and the annotated named entity set as $y = {y_1, y_2, \cdots , y_m}$. $n$ is the sentence length and $m$ is the amount of entities. Each member $y_k$ of set $y$ is a tuple $(i_k,j_k,l_k)$. $(i_k,j_k)$ is the span of an entity which corresponds to the phrase $x_{ik,jk} =[x_{ik},x_{ik+1}, \cdots ,x_{jk}]$ and $l_k$ is its label.

$$
[h_1,h_2, \cdots ,h_n] = BERT(x)
$$

$$
q_i = Softmax(Wh_i)
$$

loss:
$$
\sum_{i=1}^n -log q_i[z_i]
$$
其中$z$是和label set长度相等的向量。

为了区分和对比两种危害的影响，作者为人工数据集重新定义训练loss：

erosion rate $\alpha_p$ and misguidance rate $\beta_p$:

$f_p^a$：修改loss后训练的模型的F1
$f_0^a$：修改loss后训练的模型的F1，p=0（实体全标出来的数据集）
$f_p$：原始loss训练的模型的F1

$\alpha_p$衡量标注实体减少的影响；$\beta_p$衡量把未标注实体当做负样本的影响。

对比第二列和第三列，明显$\beta_p$增长的比较快，说明把未标注实体当做负样本的影响更大。

METHODOLOGY

用BERT作为encoder，

$$ \mathbf{s}_{i, j}=\mathbf{h}_{i} \oplus \mathbf{h}_{j} \oplus\left(\mathbf{h}_{i}-\mathbf{h}_{j}\right) \oplus\left(\mathbf{h}_{i} \odot \mathbf{h}_{j}\right) $$ $$ \mathbf{o}_{i, j}=\operatorname{Softmax}\left(\mathbf{U} \tanh \left(\mathbf{V} \mathbf{s}_{i, j}\right)\right) $$

loss：

$$ \left(\sum_{(i, j, l) \in \mathbf{y}}-\log \left(\mathbf{o}_{i, j}[l]\right)\right)+\left(\sum_{\left(i^{\prime}, j^{\prime}, l^{\prime}\right) \in \widehat{\mathbf{y}}}-\log \left(\mathbf{o}_{i^{\prime}, j^{\prime}}\left[l^{\prime}\right]\right)\right) $$

其中$\widehat{\mathbf{y}}$表示未标注实体集合，$ \lceil\lambda * n\rceil, 0<\lambda<1 $

实验结果：

lower bound

假设句子长度为$n$，每句里有一个未标注实体，则没有采样到未标注实体的概率大于$1-\frac{2}{n-5}$。

$$ \begin{array}{l} \prod_{0 \leq i<\lceil\lambda n\rceil}\left(1-\frac{1}{\frac{n(n-1)}{2}-m-i}\right)>\left(1-\frac{1}{\frac{n(n-1)}{2}-m-\lceil\lambda n\rceil}\right)^{\lceil\lambda n\rceil} \\ >\left(1-\frac{1}{\frac{n(n-1)}{2}-n-n}\right)^{n} \geq\left(1-n * \frac{1}{\frac{n(n-1)}{2}-n-n}\right)=1-\frac{2}{n-5} \end{array} $$

其中$m$表示标注实体数量，$\frac{1}{\frac{n(n-1)}{2}-m-i}$表示在一次采样中，把未标注实体作为负样本的概率。