Tensorflow：使用 GPU 进行 BERT 微调

训练数据的短缺是自然语言处理面临的最大挑战之一。因为 NLP 是一个多元化的领域，在多语言数据中具有多种任务。最特定于任务的数据集仅包含几千个训练数据，这不足以实现更好的准确性。

为了提高现代基于深度学习的 NLP 模型的性能，需要数百万或数十亿的训练数据。研究人员已经开发出各种方法来使用网络上的大量未注释文本来训练通用语言表示模型。这称为预训练。

这些预训练模型可用于为广泛的 NLP 任务（例如问答和测试分类）创建最先进的模型。它被称为微调。当我们没有足够数量的训练样本时，微调是有效的。

BERT

BERT 代表来自 Transformers 的双向编码器表示。 BERT 是由 Google AI 的研究人员推出的 NLP 框架。它是一种新的预训练语言表示模型，可在各种自然语言处理 (NLP) 任务上获得最先进的结果。只需添加单个输出层即可对预训练的 BERT 模型进行微调。你可以在这里找到 BERT 的学术论文：https://arxiv.org/abs/1810.04805。

在本教程中，您将通过一个示例学习对 BERT 模型进行微调。可以参考之前的 BERT 教程，里面已经解释了 BERT 模型的架构。

我们将使用 Kaggle 的 Quora Insincere Questions Classification 任务数据进行演示。

In [1]:
# Let's load the required packages
import pandas as pd
import numpy as np
import datetime
import zipfile
import sys
import os

下载预训练的 BERT 模型以及模型权重和配置文件

In [2]: !wget storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip

提取下载的模型 zip 文件。

In [3]:
repo = 'model_repo'
if not os.path.exists(repo):
    print("Dir created!")
    os.mkdir(repo)
with zipfile.ZipFile("uncased_L-12_H-768_A-12.zip","r") as zip_ref:
    zip_ref.extractall(repo)

In [4]:
BERT_MODEL = 'uncased_L-12_H-768_A-12'
BERT_PRETRAINED_DIR = f'{repo}/uncased_L-12_H-768_A-12'

OUTPUT_DIR = f'{repo}/outputs'
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)

print(f'***** Model output directory: {OUTPUT_DIR} *****')
print(f'***** BERT pretrained directory: {BERT_PRETRAINED_DIR} *****') 

Out[4]:

***** Model output directory: model_repo/outputs *****
***** BERT pretrained directory: model_repo/uncased_L-12_H-768_A-12 *****

准备和导入 BERT 模块

以下 BERT 模块是从 GitHub 克隆源代码并导入模块。

In [5]:
# Download the BERT modules
!wget raw.githubusercontent.com/google-research/bert/master/modeling.py 
!wget raw.githubusercontent.com/google-research/bert/master/optimization.py 
!wget raw.githubusercontent.com/google-research/bert/master/run_classifier.py 
!wget raw.githubusercontent.com/google-research/bert/master/tokenization.py
!wget raw.githubusercontent.com/google-research/bert/master/run_classifier_with_tfhub.py

In [6]: # Import BERT modules 
import modeling 
import optimization 
import run_classifier 
import tokenization 
import tensorflow as tf 
import run_classifier_with_tfhub

准备训练数据

在这里，我们将在一小部分训练数据上训练 BERT 模型。

In [7]:
from sklearn.model_selection import train_test_split

train_df =  pd.read_csv('input/train.csv')
train_df = train_df.sample(2000)                 # Train on 2000 data

train, val = train_test_split(train_df, test_size = 0.1, random_state=42)

train_lines, train_labels = train.question_text.values, train.target.values
val_lines, val_labels = val.question_text.values, val.target.values

label_list = ['0', '1']

In [8]:
def create_examples(lines, set_type, labels=None):
    guid = f'{set_type}'
    examples = []
    if guid == 'train':
        for line, label in zip(lines, labels):
            text_a = line
            label = str(label)
            examples.append(
              run_classifier.InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
    else:
        for line in lines:
            text_a = line
            label = '0'
            examples.append(
              run_classifier.InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
    return examples

指定 BERT 预训练模型。

这里使用的是 uncased_L-12_H-768_A-12 型号。该模型由12层、768个隐藏、12个头、110M个参数组成。它是一个 Uncased 模型，这意味着文本在标记化之前已被小写。

In [9]:
BERT_MODEL = 'uncased_L-12_H-768_A-12' 
BERT_MODEL_HUB = 'https://tfhub.dev/google/bert_' + BERT_MODEL + '/1'

初始化模型超参数。

In [10]:
TRAIN_BATCH_SIZE = 32
EVAL_BATCH_SIZE = 8
LEARNING_RATE = 2e-5
NUM_TRAIN_EPOCHS = 3.0
WARMUP_PROPORTION = 0.1
MAX_SEQ_LENGTH = 128

# Model Configuration
SAVE_CHECKPOINTS_STEPS = 1000 
ITERATIONS_PER_LOOP = 1000
NUM_TPU_CORES = 8

VOCAB_FILE = os.path.join(BERT_PRETRAINED_DIR, 'vocab.txt')
CONFIG_FILE = os.path.join(BERT_PRETRAINED_DIR, 'bert_config.json')
INIT_CHECKPOINT = os.path.join(BERT_PRETRAINED_DIR, 'bert_model.ckpt')
DO_LOWER_CASE = BERT_MODEL.startswith('uncased')

tpu_cluster_resolver = None   # Model trained on GPU, we won't need a cluster resolver

def get_run_config(output_dir):
    return tf.contrib.tpu.RunConfig(
    cluster=tpu_cluster_resolver,
    model_dir=output_dir,
    save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS,
    tpu_config=tf.contrib.tpu.TPUConfig(
        iterations_per_loop=ITERATIONS_PER_LOOP,
        num_shards=NUM_TPU_CORES,
        per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))

加载分词器模块

注意：当您使用 Cased 模型时，传递 do_lower_case = False。

In [11]:
tokenizer = tokenization.FullTokenizer(vocab_file=VOCAB_FILE, do_lower_case=DO_LOWER_CASE)
train_examples = create_examples(train_lines, 'train', labels=train_labels)

# compute number of train and warmup steps from batch size
num_train_steps = int( len(train_examples) / TRAIN_BATCH_SIZE * NUM_TRAIN_EPOCHS)
num_warmup_steps = int(num_train_steps * WARMUP_PROPORTION)

微调来自 TF Hub 的预训练 BERT 模型

本节说明了来自 TensorFlow 集线器模块的微调预训练 BERT 模型。

In [12]:

model_fn = run_classifier_with_tfhub.model_fn_builder(
  num_labels=len(label_list),
  learning_rate=LEARNING_RATE,
  num_train_steps=num_train_steps,
  num_warmup_steps=num_warmup_steps,
  use_tpu=False,
  bert_hub_module_handle=BERT_MODEL_HUB
)

estimator_from_tfhub = tf.contrib.tpu.TPUEstimator(
  use_tpu=False,    #If False training will fall on CPU or GPU
  model_fn=model_fn,
  config=get_run_config(OUTPUT_DIR),
  train_batch_size=TRAIN_BATCH_SIZE,
  eval_batch_size=EVAL_BATCH_SIZE,
)

In [13]:
# Train the model
def model_train(estimator):
    print('Please wait...')
    train_features = run_classifier.convert_examples_to_features(
      train_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
    print('***** Started training at {} *****'.format(datetime.datetime.now()))
    print('  Num examples = {}'.format(len(train_examples)))
    print('  Batch size = {}'.format(TRAIN_BATCH_SIZE))
    tf.logging.info("  Num steps = %d", num_train_steps)
    train_input_fn = run_classifier.input_fn_builder(
      features=train_features,
      seq_length=MAX_SEQ_LENGTH,
      is_training=True,
      drop_remainder=True)
    estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
    print('***** Finished training at {} *****'.format(datetime.datetime.now()))

In [14]: model_train(estimator_from_tfhub)

In [15]:
# Evaluate the model
def model_eval(estimator):
    
    eval_examples = create_examples(val_lines, 'test')
    
    eval_features = run_classifier.convert_examples_to_features(
        eval_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
        
    print('***** Started evaluation at {} *****'.format(datetime.datetime.now()))
    print('  Num examples = {}'.format(len(eval_examples)))
    print('  Batch size = {}'.format(EVAL_BATCH_SIZE))
    
    eval_steps = int(len(eval_examples) / EVAL_BATCH_SIZE)
    
    eval_input_fn = run_classifier.input_fn_builder(
      features=eval_features,
      seq_length=MAX_SEQ_LENGTH,
      is_training=False,
      drop_remainder=True)
    
    result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
    
    print('***** Finished evaluation at {} *****'.format(datetime.datetime.now()))
    
    print("***** Eval results *****")
    for key in sorted(result.keys()):
        print('  {} = {}'.format(key, str(result[key])))

In [16]: model_eval(estimator_from_tfhub)

从检查点微调预训练的 BERT 模型

您还可以从保存的检查点加载预训练的 BERT 模型。

In [17]:
CONFIG_FILE = os.path.join(BERT_PRETRAINED_DIR, 'bert_config.json')
INIT_CHECKPOINT = os.path.join(BERT_PRETRAINED_DIR, 'bert_model.ckpt')

OUTPUT_DIR = f'{repo}/outputs_checkpoints'
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)

model_fn = run_classifier.model_fn_builder(
    bert_config=modeling.BertConfig.from_json_file(CONFIG_FILE),
    num_labels=len(label_list),
    init_checkpoint=INIT_CHECKPOINT,
    learning_rate=LEARNING_RATE,
    num_train_steps=num_train_steps,
    num_warmup_steps=num_warmup_steps,
    use_tpu=False, #If False training will fall on CPU or GPU, 
    use_one_hot_embeddings=True)

estimator_from_checkpoints = tf.contrib.tpu.TPUEstimator(
    use_tpu=False,
    model_fn=model_fn,
    config=get_run_config(OUTPUT_DIR),
    train_batch_size=TRAIN_BATCH_SIZE,
    eval_batch_size=EVAL_BATCH_SIZE)

In [18]: 
# Train the Model
model_train(estimator_from_checkpoints)

# Evaluate the Model
In [19]: model_eval(estimator_from_checkpoints)