Bert原理及源码解读

本文结构为:
  • Bert网络结构
    • Transformer
  • Bert特点
  • Bert输入与预处理
    • tokenization.py
    • create_pretraining_data.py
  • Bert 模型代码
    • modeling.py
  • 预训练
  • 下游任务

一, Bert网络结构

image-20191008154841673

  • BERT用transformer方法取代了ELMo中用LSTM提取特征的方法;
  • BERT解决了GPT中单向语言模型的方法,变为双向;
  • BERT采用了Fine tuning方式。(ELMo为Feature-based:将训练出的representation作为feature用于任务,词向量、句向量、段向量、文本向量都是这样的)

Bert提供了简单和复杂两个模型,对应的超参数分别如下:

  • \(BERT_{BASE}\): L=12, H=768, A=12,参数总量110M;
  • \(BERT_{LARGE}\): L=24, H=1024, A=16,参数总量340M;

其中,L表示网络的层数,即Transformer blocks的数量,A表示Multi-Head Attention的数量

Transformer

image-20191008161255984

Transformer为encoder-decoder​结构,如上图所示,左半部分为encoder,右半部分为decoder。

Encoder

encoder由6个相同的layer组成,每个layer由两个sub-layer组成,分别是multi-head self-attention mechanism和fully connected feed-forward network。其中每个sub-layer都加了residual connection 和 normalisation,因此可以将sub-layer的输出表示为:

\[sub\_layer\_output = LayerNorm(x + (SubLayer(x)))\]
  1. Multi-head self-attention mechanism

    attention可表示为:

\[attention\_output = Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}}V)\]

multi-head attention是通过h个不同的线性变换对Q,K,V进行投影,最后将不同的attention结果拼接起来:

\[MultiHead(Q, K, V) = Concat(head_1, \cdots, head_h) W^o \\ head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)\]

Self-attention则是取Q,K,V相同。

  1. Position-wise feed-forward networks

    全联接层。

二, Bert特点

  • Bert预训练模型的输入向量是Token Embeddings + Segment Embeddings + Position Embeddings。
  • Bert训练的词向量不是完整的,而是WordPiece Embedding,因此要通过Bert模型得到英文Word Embedding要将WordPiece Embeddings转化为Word Embedding。
  • 基于self-attention的Transformer结构。
  • 使用Masked Language Model (masked LM)和Next Sentence Prediction (NSP)两个任务共同参与预训练过程,loss为两个任务的loss之和。
  • MLM任务中,随机mask15%的token,而不是像cbow一样把每个词都预测一遍。损失函数只计算被mask掉的token的损失。mask的token中80%被替换为[MASK],10%被替代成其他单词,10%的token不做替换。
  • 激活函数使用gelus。

下面从源码出发对Bert进行详细介绍。

三, Bert输入与预处理

tokenization.py
  • FullTokenizer

    Bert 里的分词主要由FullTokenizer类来实现。调用BasicTokenizer和WordpieceTokenizer。前者是根据空格、标点等进行普通的分词,而后者会把前者的结果再细粒度的切分为WordPiece。

    class FullTokenizer(object):
        """Runs end-to-end tokenziation."""
      
        def __init__(self, vocab_file, do_lower_case=True):
            self.vocab = load_vocab(vocab_file)
            self.inv_vocab = {v: k for k, v in self.vocab.items()}
            self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
            self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
          
        def tokenize(self, text):
            split_tokens = []
            for token in self.basic_tokenizer.tokenize(text):
                for sub_token in self.wordpiece_tokenizer.tokenize(token):
                    split_tokens.append(sub_token)
            return split_tokens
      
        def convert_tokens_to_ids(self, tokens):
            return convert_by_vocab(self.vocab, tokens)
      
        def convert_ids_to_tokens(self, ids):
            return convert_by_vocab(self.inv_vocab, ids)
    
  • BasicTokenizer : 根据空格、标点等进行普通的分词,以及对汉字切分为字。

    def _is_chinese_char(self, cp):
        """判断一个unicode字符是否是汉字"""
        if ((cp >= 0x4E00 and cp <= 0x9FFF) or 
            (cp >= 0x3400 and cp <= 0x4DBF) or 
            (cp >= 0x20000 and cp <= 0x2A6DF) or 
            (cp >= 0x2A700 and cp <= 0x2B73F) or
            (cp >= 0x2B740 and cp <= 0x2B81F) or
            (cp >= 0x2B820 and cp <= 0x2CEAF) or
            (cp >= 0xF900 and cp <= 0xFAFF) or
            (cp >= 0x2F800 and cp <= 0x2FA1F)):
            return True
        return False
      
    def _run_strip_accents(self, text):
        """去除accents"""
        text = unicodedata.normalize("NFD", text)
        output = []
        for char in text:
            cat = unicodedata.category(char)
            if cat == "Mn":
                continue
            output.append(char)
        return "".join(output)
    
  • WordpieceTokenizer : 对英文单词切分为细粒度的WordPiece。(Bert做mask时,对属于同一个单词的WordPiece同时mask)

    def tokenize(self, text):
    		"""
        使用贪心的最大正向匹配算法把一个词切分成word piece。##表示这个词是接着前面的,WordPiece切分是可逆的——我们可以恢复出“真正”的词。
        """
        text = convert_to_unicode(text)
      
        output_tokens = []
        for token in whitespace_tokenize(text):
            chars = list(token)
            if len(chars) > self.max_input_chars_per_word:
                output_tokens.append(self.unk_token)
                continue
      
            is_bad = False
            start = 0
            sub_tokens = []
            while start < len(chars):
                end = len(chars)
                cur_substr = None
                while start < end:
                    substr = "".join(chars[start:end])
                    if start > 0:
                        substr = "##" + substr
                    if substr in self.vocab:
                        cur_substr = substr
                        break
                    end -= 1
                if cur_substr is None:
                    is_bad = True
                    break
                sub_tokens.append(cur_substr)
                start = end
      
            if is_bad:
                output_tokens.append(self.unk_token)
            else:
                output_tokens.extend(sub_tokens)
        return output_tokens
    
create_pretraining_data.py

这块最重要的操作是函数create_instances_from_document中的内容,从单个文档中生成由TrainingInstance对象组成的instances列表,涉及到Masked Language Model (masked LM)和Next Sentence Prediction (NSP)数据准备的具体实现细节。

def create_instances_from_document(all_documents, document_index, max_seq_length, short_seq_prob,
                                   masked_lm_prob, max_predictions_per_seq, vocab_words, rng):
    """从单个文档生成TrainingInstance对象"""
    document = all_documents[document_index]

    # 预留[CLS], [SEP], [SEP]的位置
    max_num_tokens = max_seq_length - 3

    # 增加一定的随机性,并减少pre-training和fine-tuning的差别,设置target_seq_length,但target_seq_length只是粗略目标,
    # 最终的长度不一定会短于target_seq_length,但一定不会超过max_seq_length
    target_seq_length = max_num_tokens
    if rng.random() < short_seq_prob:
        target_seq_length = rng.randint(2, max_num_tokens)

    # 根据document中的sentences,获得segment_A和segment_B,(一个segment可能只有一个句子,也可能由多个句子拼接而成)
    instances = []
    current_chunk = []
    current_length = 0
    i = 0
    while i < len(document):
        segment = document[i]
        current_chunk.append(segment)
        current_length += len(segment)
        if i == len(document) - 1 or current_length >= target_seq_length:
            if current_chunk:
                # a_end表示current_chunk中有多少segments会进入sengment_A
                a_end = 1
                if len(current_chunk) >= 2:
                    a_end = rng.randint(1, len(current_chunk) - 1)

                tokens_a = []
                for j in range(a_end):
                    tokens_a.extend(current_chunk[j])

                tokens_b = []
                is_random_next = False
                if len(current_chunk) == 1 or rng.random() < 0.5:
                    is_random_next = True
                    target_b_length = target_seq_length - len(tokens_a)

                    # 随机选择其他文档
                    for _ in range(10):
                        random_document_index = rng.randint(0, len(all_documents) - 1)
                        if random_document_index != document_index:
                            break

                    random_document = all_documents[random_document_index]
                    random_start = rng.randint(0, len(random_document) - 1)
                    for j in range(random_start, len(random_document)):
                        tokens_b.extend(random_document[j])
                        if len(tokens_b) >= target_b_length:
                            break
                    # 确定current_chunk中未用到的句子,并将遍历该文档句子的索引重新置位到未用到的句子的位置。不浪费句子中的信息。
                    num_unused_segments = len(current_chunk) - a_end
                    i -= num_unused_segments
                else:
                    # 使用真实的下一句
                    is_random_next = False
                    for j in range(a_end, len(current_chunk)):
                        tokens_b.extend(current_chunk[j])
                # 选择tokens_a和tokens_b中较长的做截断,随机从头或尾做截断
                truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng)

                assert len(tokens_a) >= 1
                assert len(tokens_b) >= 1

                tokens = []
                segment_ids = []
                tokens.append("[CLS]")
                segment_ids.append(0)
                for token in tokens_a:
                    tokens.append(token)
                    segment_ids.append(0)

                tokens.append("[SEP]")
                segment_ids.append(0)

                for token in tokens_b:
                    tokens.append(token)
                    segment_ids.append(1)
                tokens.append("[SEP]")
                segment_ids.append(1)
                # 随机选择15%的token做mask,80%替换成'[MASK]', 10%保持不变,10%替换成任意token
                (tokens, masked_lm_positions,masked_lm_labels) \
                    = create_masked_lm_predictions(tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng)
                instance = TrainingInstance(
                                tokens=tokens,
                                segment_ids=segment_ids,
                                is_random_next=is_random_next,
                                masked_lm_positions=masked_lm_positions,
                                masked_lm_labels=masked_lm_labels)
                instances.append(instance)
            current_chunk = []
            current_length = 0
        i += 1
    return instances

四, Bert 模型代码

modeling.py
class BertModel(object):
    """
    BERT model ("Bidirectional Encoder Representations from Transformers").
    Example usage:
    # Already been converted into WordPiece token ids
    input_ids = tf.constant([[31, 51, 99], [15, 5, 0]])
    input_mask = tf.constant([[1, 1, 1], [1, 1, 0]])
    token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]])

    config = modeling.BertConfig(vocab_size=32000, hidden_size=512,
                                 num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)

    model = modeling.BertModel(config=config, is_training=True,
                               input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids)

    label_embeddings = tf.get_variable(...)
    pooled_output = model.get_pooled_output()
    logits = tf.matmul(pooled_output, label_embeddings)
    """

    def __init__(self, config,
                is_training,
                input_ids,
                input_mask=None,
                token_type_ids=None,
                use_one_hot_embeddings=False,
                scope=None):
        """
        Constructor for BertModel.
        """
        config = copy.deepcopy(config)
        if not is_training:
            config.hidden_dropout_prob = 0.0
            config.attention_probs_dropout_prob = 0.0

        input_shape = get_shape_list(input_ids, expected_rank=2)
        batch_size = input_shape[0]
        seq_length = input_shape[1]

        if input_mask is None:
            input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)

        if token_type_ids is None:
            token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)

        with tf.variable_scope(scope, default_name="bert"):
            with tf.variable_scope("embeddings"):
              	# 输入的token_ids会转换成config.hidden_size(768)维的embedding
                (self.embedding_output, self.embedding_table) = embedding_lookup(
                    input_ids=input_ids,
                    vocab_size=config.vocab_size,
                    embedding_size=config.hidden_size,
                    initializer_range=config.initializer_range,
                    word_embedding_name="word_embeddings",
                    use_one_hot_embeddings=use_one_hot_embeddings)
                # 将token_embedding, token_type_embeddings和position_embeddings拼接共同作为模型输入。貌似三个都是使用截断正态分布生成初始化768维的向量。
                self.embedding_output = embedding_postprocessor(
                    input_tensor=self.embedding_output,
                    use_token_type=True,
                    token_type_ids=token_type_ids,
                    token_type_vocab_size=config.type_vocab_size,
                    token_type_embedding_name="token_type_embeddings",
                    use_position_embeddings=True,
                    position_embedding_name="position_embeddings",
                    initializer_range=config.initializer_range,
                    max_position_embeddings=config.max_position_embeddings,
                    dropout_prob=config.hidden_dropout_prob)
            # Transformer的encoder部分,深度为num_hidden_layers,Multi-Head Attention的数量为num_attention_heads
            with tf.variable_scope("encoder"):
                attention_mask = create_attention_mask_from_input_mask(input_ids, input_mask)

                self.all_encoder_layers = transformer_model(
                    input_tensor=self.embedding_output,
                    attention_mask=attention_mask,
                    hidden_size=config.hidden_size,
                    num_hidden_layers=config.num_hidden_layers,
                    num_attention_heads=config.num_attention_heads,
                    intermediate_size=config.intermediate_size,
                    intermediate_act_fn=get_activation(config.hidden_act),
                    hidden_dropout_prob=config.hidden_dropout_prob,
                    attention_probs_dropout_prob=config.attention_probs_dropout_prob,
                    initializer_range=config.initializer_range,
                    do_return_all_layers=True)

            self.sequence_output = self.all_encoder_layers[-1]
            with tf.variable_scope("pooler"):
              first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
              self.pooled_output = tf.layers.dense(
                  first_token_tensor,
                  config.hidden_size,
                  activation=tf.tanh,
                  kernel_initializer=create_initializer(config.initializer_range))

transformer_model中激活函数使用gelus,数学表达为: \(GELU(x)=xP(X<=x)=x\Phi(x)\) Gelus其实是dropout和Relus的综合,是输入乘以一个(0,1)区间的数,这么设定使得当输入x减小的时候,输入会有一个更高的概率被dropout掉。

对于标准正太分布的 \(GELU(x)\),论文中提供了近似计算的数学公式: \(GELU(x)=0.5x(1+\tanh[\sqrt{2/\pi}(x+0.044715x^3)])\)

def transformer_model(input_tensor, attention_mask=None, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, 
                      intermediate_size=3072, intermediate_act_fn=gelu, hidden_dropout_prob=0.1, 
                      attention_probs_dropout_prob=0.1, initializer_range=0.02, do_return_all_layers=False):
    """
    Multi-headed, multi-layer Transformer from "Attention is All You Need".
    """
    # 因为是Multi-head attention且为self-attention, 所以需要整除
    if hidden_size % num_attention_heads != 0:
        raise ValueError(
            "The hidden size (%d) is not a multiple of the number of attention heads (%d)"
            % (hidden_size, num_attention_heads))

    attention_head_size = int(hidden_size / num_attention_heads)
    input_shape = get_shape_list(input_tensor, expected_rank=3)
    batch_size = input_shape[0]
    seq_length = input_shape[1]
    input_width = input_shape[2]

    # The Transformer performs sum residuals on all layers so the input needs to be the same as the hidden size.
    if input_width != hidden_size:
        raise ValueError("The width of the input tensor (%d) != hidden size (%d)" 
                         % (input_width, hidden_size))

    prev_output = reshape_to_matrix(input_tensor)

    all_layer_outputs = []
    for layer_idx in range(num_hidden_layers):
        with tf.variable_scope("layer_%d" % layer_idx):
        layer_input = prev_output

        with tf.variable_scope("attention"):
            attention_heads = []
            with tf.variable_scope("self"):
            attention_head = attention_layer(
                from_tensor=layer_input,
                to_tensor=layer_input,
                attention_mask=attention_mask,
                num_attention_heads=num_attention_heads,
                size_per_head=attention_head_size,
                attention_probs_dropout_prob=attention_probs_dropout_prob,
                initializer_range=initializer_range,
                do_return_2d_tensor=True,
                batch_size=batch_size,
                from_seq_length=seq_length,
                to_seq_length=seq_length)
            attention_heads.append(attention_head)

            attention_output = None
            if len(attention_heads) == 1:
                attention_output = attention_heads[0]
            else:
                attention_output = tf.concat(attention_heads, axis=-1)

            with tf.variable_scope("output"):
                attention_output = tf.layers.dense(
                    attention_output,
                    hidden_size,
                    kernel_initializer=create_initializer(initializer_range))
                attention_output = dropout(attention_output, hidden_dropout_prob)
                attention_output = layer_norm(attention_output + layer_input)

            with tf.variable_scope("intermediate"):
                intermediate_output = tf.layers.dense(
                    attention_output,
                    intermediate_size,
                    activation=intermediate_act_fn,
                    kernel_initializer=create_initializer(initializer_range))

            # Down-project back to `hidden_size` then add the residual.
            with tf.variable_scope("output"):
                layer_output = tf.layers.dense(
                    intermediate_output,
                    hidden_size,
                    kernel_initializer=create_initializer(initializer_range))
                layer_output = dropout(layer_output, hidden_dropout_prob)
                layer_output = layer_norm(layer_output + attention_output)
                prev_output = layer_output
                all_layer_outputs.append(layer_output)

    if do_return_all_layers:
        final_outputs = []
        for layer_output in all_layer_outputs:
            final_output = reshape_from_matrix(layer_output, input_shape)
            final_outputs.append(final_output)
        return final_outputs
    else:
        final_output = reshape_from_matrix(prev_output, input_shape)
        return final_output
def attention_layer(from_tensor,
                    to_tensor,
                    attention_mask=None,
                    num_attention_heads=1,
                    size_per_head=512,
                    query_act=None,
                    key_act=None,
                    value_act=None,
                    attention_probs_dropout_prob=0.0,
                    initializer_range=0.02,
                    do_return_2d_tensor=False,
                    batch_size=None,
                    from_seq_length=None,
                    to_seq_length=None):
    """
    Performs multi-headed attention from `from_tensor` to `to_tensor`.
    """
    def transpose_for_scores(input_tensor, batch_size, num_attention_heads, seq_length, width):
        output_tensor = tf.reshape(input_tensor, [batch_size, seq_length, num_attention_heads, width])
        output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])
        return output_tensor

    from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
    to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])

    if len(from_shape) != len(to_shape):
        raise ValueError("The rank of `from_tensor` must match the rank of `to_tensor`.")

    if len(from_shape) == 3:
        batch_size = from_shape[0]
        from_seq_length = from_shape[1]
        to_seq_length = to_shape[1]
    elif len(from_shape) == 2:
        if (batch_size is None or from_seq_length is None or to_seq_length is None):
            raise ValueError("When passing in rank 2 tensors to attention_layer, the values for `batch_size`, "
                             "`from_seq_length`, and `to_seq_length` must all be specified.")

    # Scalar dimensions referenced here:
    #   B = batch size (number of sequences)
    #   F = `from_tensor` sequence length
    #   T = `to_tensor` sequence length
    #   N = `num_attention_heads`
    #   H = `size_per_head`

    from_tensor_2d = reshape_to_matrix(from_tensor)
    to_tensor_2d = reshape_to_matrix(to_tensor)

    # `query_layer` = [B*F, N*H]
    query_layer = tf.layers.dense(
        from_tensor_2d,
        num_attention_heads * size_per_head,
        activation=query_act,
        name="query",
        kernel_initializer=create_initializer(initializer_range))

    # `key_layer` = [B*T, N*H]
    key_layer = tf.layers.dense(
        to_tensor_2d,
        num_attention_heads * size_per_head,
        activation=key_act,
        name="key",
        kernel_initializer=create_initializer(initializer_range))

    # `value_layer` = [B*T, N*H]
    value_layer = tf.layers.dense(
        to_tensor_2d,
        num_attention_heads * size_per_head,
        activation=value_act,
        name="value",
        kernel_initializer=create_initializer(initializer_range))

    # `query_layer` = [B, N, F, H]
    query_layer = transpose_for_scores(query_layer, batch_size, num_attention_heads, from_seq_length, size_per_head)

    # `key_layer` = [B, N, T, H]
    key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads, to_seq_length, size_per_head)

    # Take the dot product between "query" and "key" to get the raw attention scores.
    # `attention_scores` = [B, N, F, T]
    attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
    attention_scores = tf.multiply(attention_scores, 1.0 / math.sqrt(float(size_per_head)))

    if attention_mask is not None:
        # `attention_mask` = [B, 1, F, T]
        attention_mask = tf.expand_dims(attention_mask, axis=[1])

        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for masked positions, this operation 
        # will create a tensor which is 0.0 for positions we want to attend and -10000.0 for masked positions.
        adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0

        # Since we are adding it to the raw scores before the softmax, this is effectively the same as removing these entirely.
        attention_scores += adder

    # Normalize the attention scores to probabilities.
    # `attention_probs` = [B, N, F, T]
    attention_probs = tf.nn.softmax(attention_scores)

    # This is actually dropping out entire tokens to attend to, which might
    # seem a bit unusual, but is taken from the original Transformer paper.
    attention_probs = dropout(attention_probs, attention_probs_dropout_prob)

    # `value_layer` = [B, T, N, H]
    value_layer = tf.reshape(value_layer, [batch_size, to_seq_length, num_attention_heads, size_per_head])

    # `value_layer` = [B, N, T, H]
    value_layer = tf.transpose(value_layer, [0, 2, 1, 3])

    # `context_layer` = [B, N, F, H]
    context_layer = tf.matmul(attention_probs, value_layer)

    # `context_layer` = [B, F, N, H]
    context_layer = tf.transpose(context_layer, [0, 2, 1, 3])

    if do_return_2d_tensor:
        # `context_layer` = [B*F, N*H]
        context_layer = tf.reshape(context_layer, [batch_size * from_seq_length, num_attention_heads * size_per_head])
    else:
        # `context_layer` = [B, F, N*H]
        context_layer = tf.reshape(context_layer, [batch_size, from_seq_length, num_attention_heads * size_per_head])

    return context_layer

五, 预训练

预训练loss为Masked Language Model (masked LM)和Next Sentence Prediction (NSP)两个任务的loss之和。这里详细可参考Bert源码,没什么特别需要解释的地方。

六, 下游任务

因为Bert预训练预料丰富模型庞大,Bert可适用的下游任务也很广泛:

  • 序列标注:分词、POS Tag、NER、语义标注
  • 分类任务:文本分类、情感计算
  • 句子关系判断:Entailment、QA、自然语言推断
  • 生成式任务:机器翻译、文本摘要等

image-20191028193826912