深度学习 – gitweixin

Tensorflow, 人工智能 4月 5,2022

解决在kaggle等在线平台运行报错module ‘tensorflow’ has no attribute ‘InteractiveSession’

学习tensorflow，除了本地安装一个外，更喜欢去在线平台玩，特别是kaggle这种带有数据集的。看书仿照下面的例子，没想到运行出错了。

# 进入一个交互式 TensorFlow 会话.
import tensorflow as tf

sess = tf.InteractiveSession()

x = tf.Variable([1.0, 2.0])
a = tf.constant([3.0, 3.0])

# 使用初始化器 initializer op 的 run() 方法初始化 'x' 
x.initializer.run()

# 增加一个减法 sub op, 从 'x' 减去 'a'. 运行减法 op, 输出结果 
sub = tf.subtract(x, a)
print(sub.eval())

运行报下面错误：

--------------------------------------------------------------------------- AttributeError                            Traceback (most recent call last) /tmp/ipykernel_33/152899263.py in <module>      
2 import tensorflow as tf      
3  ----> 
4 sess = tf.InteractiveSession()      
5 #tf.compat.v1.disable_eager_execution()    
6 #sess = tf.compat.v1.InteractiveSession()
 AttributeError: module 'tensorflow' has no attribute 'InteractiveSession'

由于版本问题，要用下面的语句替代

使用 sess = tf.compat.v1.InteractiveSession()

解决了这个问题，又出现新的问题：
AttributeError: ‘NoneType’ object has no attribute ‘run’ ，
这需要在sess之前添加tf.compat.v1.disable_eager_execution()

新的完整代码如下：

# 进入一个交互式 TensorFlow 会话.
import tensorflow as tf

tf.compat.v1.disable_eager_execution()
sess = tf.compat.v1.InteractiveSession()
x = tf.Variable([1.0, 2.0])
a = tf.constant([3.0, 3.0])

# 使用初始化器 initializer op 的 run() 方法初始化 'x' 
x.initializer.run()

# 增加一个减法 sub op, 从 'x' 减去 'a'. 运行减法 op, 输出结果 
sub = tf.subtract(x, a)
print(sub.eval())

作者 east

深度学习 3月 29,2022

什么是Google BERT如何对它进行优化

听说过 Google 的新更新 BERT？如果您对搜索引擎优化 (SEO) 很感兴趣，您可能会拥有。在 SEO 世界中对 Google BERT 的炒作是有道理的，因为 BERT 使搜索更多地关注单词背后的语义或含义，而不是单词本身。

换句话说，搜索意图比以往任何时候都更加重要。谷歌最近更新的 BERT 影响了 SEO 世界，影响了十分之一的搜索查询，谷歌预计随着时间的推移，这将随着更多的语言和地区而增加。由于 BERT 将对搜索产生巨大影响，因此拥有高质量的内容比以往任何时候都更加重要。

为了使您的内容能够为 BERT（和搜索意图）发挥最佳效果，在本文中，我们将介绍 BERT 如何与搜索一起工作，以及如何使用 BERT 为您的网站带来更多流量。想与 SEO 专家交谈？与 WebFX 连接！

什么是 BERT？
BERT 代表来自 Transformers 的双向编码器表示。现在，这是一个包含一些非常技术性的机器学习术语的术语！

这是什么意思：

双向：BERT 同时对两个方向的句子进行编码
编码器表示：BERT 将句子翻译成它可以理解的词义表示
Transformers：允许 BERT 使用相对位置对句子中的每个单词进行编码，因为上下文在很大程度上取决于单词顺序（这是一种比准确记住句子如何输入框架更有效的方法）
如果你要改写它，你可以说 BERT 使用转换器来编码目标单词两侧的单词表示。从根本上说，BERT 是一个全新的、从未实现过的、最先进的自然语言处理 (NLP) 算法框架。这种类型的结构为谷歌的人工智能增加了一层机器学习，旨在更好地理解人类语言。

换句话说，通过这次新的更新，谷歌的人工智能算法可以以比以往更高水平的人类语境理解和常识来阅读句子和查询。虽然它对语言的理解程度不如人类，但它仍然是 NLP 在机器语言理解方面向前迈出的一大步。

BERT 不是什么
Google BERT 不会像之前的算法更新（如 Penguin 或 Panda）那样改变网页的判断方式。它不会将页面评为正面或负面。相反，它改进了对话式搜索查询中的搜索结果，因此结果更好地匹配其背后的意图。

BERT 历史
BERT 的存在时间比几个月前推出的 BIG 更新要长。自 2018 年 10 月发表研究论文 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 以来，自然学习处理 (NLP) 和机器学习 (ML) 社区一直在讨论它。不久之后，Google 发布了一个突破性的开源 NLP 框架，该框架基于 NLP 社区可以用来研究 NLP 并将其整合到他们的项目中的论文。

从那以后，出现了几个基于或合并了 BERT 的新 NLP 框架，包括谷歌和丰田的组合 ALBERT、Facebook 的 RoBERTa、微软的 MT-DNN 和 IBM 的 BERT-mtl。 BERT 在 NLP 社区引起的波澜占互联网上的大部分提及，但 BERT 在 SEO 世界中的提及正在获得牵引力。这是因为 BERT 专注于长尾查询中的语言以及像人类一样阅读网站，以便为搜索查询提供更好的结果。

BERT 是如何工作的？
Google BERT 是一个非常复杂的框架，理解它需要多年研究 NLP 理论和过程。搜索引擎优化世界不需要那么深入，但了解它在做什么以及为什么对于理解它将如何影响搜索结果从现在开始很有用。

因此，以下是 Google BERT 的工作原理：

谷歌 BERT 解释
以下是 BERT 如何从整体上查看句子或搜索查询的上下文：

BERT 接受查询
逐字逐句分解
查看单词之间所有可能的关系
构建一个双向地图，概述两个方向上的单词之间的关系
当单词彼此配对时，分析单词背后的上下文含义。
好的，为了更好地理解这一点，我们将使用以下示例：

每行代表“pandas”的含义如何改变句子中其他单词的含义，反之亦然。关系是双向的，所以箭头是双向的。当然，这是 BERT 如何看待上下文的一个非常非常简单的例子。

这个例子只检查我们的目标词“pandas ”和句子中其他有意义的片段之间的关系。然而，BERT 分析句子中所有单词的上下文关系。这张图可能更准确一点：

BERT 的类比
BERT 使用 Encoders 和 Decoders 来分析单词之间的关系。想象一下 BERT 如何作为翻译过程发挥作用，提供了一个很好的例子来说明它是如何工作的。您从输入开始，无论您想翻译成另一种语言的任何句子。

假设您想将上面的熊猫句子从英语翻译成韩语。不过，BERT 不懂英语或韩语，所以它使用编码器来翻译“熊猫除了竹子还吃什么？”变成它确实理解的语言。这种语言是它在分析语言的过程中为自己构建的语言（这是编码器表示的来源）。

BERT 根据单词的相对位置和对句子含义的重要性来标记单词。然后它将它们映射到一个抽象向量上，从而创建一种想象的语言。因此，BERT 将我们的英语句子转换为其想象的语言，然后使用解码器将想象的语言转换为韩语。

该过程非常适合翻译，但它也提高了任何基于 BERT 的 NLP 模型正确解析语言歧义的能力，例如：

代词参考
同义词和同音词
或具有多个定义的单词，例如“运行”
BERT 经过预训练
BERT 是经过预训练的，这意味着它有很多学习内容。但是使 BERT 与以前的 NLP 框架不同的一件事是 BERT 是在纯文本上进行预训练的。其他 NLP 框架需要一个由语言学家精心标记句法的单词数据库来理解单词。

语言学家必须将数据库中的每个单词标记为词性。这是一个严格而苛刻的过程，可能会在语言学家之间引发冗长的激烈辩论。词性可能很棘手，尤其是当词性由于句子中的其他单词而发生变化时。

BERT 自己做这件事，而且它是在无人监督的情况下做的，这使它成为世界上第一个这样做的 NLP 框架。它是使用维基百科训练的。那是超过 25 亿字！

BERT 可能并不总是准确的，但它分析的数据库越多，它的准确度就会越高。

BERT 是双向的
BERT 对句子进行双向编码。简而言之，BERT 在一个句子中取一个目标词，并在任一方向查看围绕它的所有词。 BERT 的深度双向编码器在 NLP 框架中是独一无二的。

早期的 NLP 框架（例如 OpenAI GPT）仅在一个方向上对句子进行编码，在 OpenAI GPT 的情况下是从左到右。后来的模型（如 ELMo）可以在目标词的左侧和右侧进行训练，但这些模型独立地连接编码。这会导致目标词的每一侧之间的上下文断开。

另一方面，BERT 识别目标单词两侧所有单词的上下文，并且同时完成所有操作。这意味着它可以完全看到和理解单词的含义如何影响整个句子的上下文。

单词如何相互关联（意味着它们一起出现的频率）是语言学家所说的搭配。

搭配词是经常一起出现的词——例如，“圣诞节”和“礼物”经常出现在每个词的几个词中。能够识别搭配有助于确定单词的含义。在我们之前的示例图像中，“trunk”可以有多种含义：

the main woody stem of a tree
the torso of a person or animal
a large box for holding travel items
the prehensile nose of an elephant
the storage compartment of a vehicle.

树的主要木质茎
人或动物的躯干
一个装旅行用品的大盒子
大象的鼻子
车辆的储藏室。
确定这句话中所用单词含义的唯一方法是查看周围的搭配。 “低音炮”通常与“汽车”一起出现，“后备箱”也是如此，因此根据上下文，“车辆储物箱”的定义可能是正确的答案。这正是 BERT 在查看句子时所做的。

它通过使用从预训练中学到的单词搭配来识别句子中每个单词的上下文。如果 BERT 单向阅读句子，则可能会错过识别低音炮和后备箱之间“汽车”的共享搭配。双向和整体查看句子的能力解决了这个问题。

BERT 使用变压器
BERT 的双向编码功能与转换器，这是有道理的。如果您还记得，BERT 中的“T”代表变压器。谷歌认为 BERT 是他们在变压器研究方面取得突破的结果。

谷歌将转换器定义为“处理与句子中所有其他单词相关的单词的模型，而不是按顺序一个接一个地处理。” Transformers 使用 Encoders 和 Decoders 来处理句子中单词之间的关系。 BERT 提取句子的每个单词，并赋予它单词含义的表示。每个单词的含义相互关联的强度由线条的饱和度来表示。

在下图的情况下，在左侧，“它”与“the”和“animal”的联系最紧密，在这种情况下识别“it”指的是什么。在右边，“it”与“street”的联系最为紧密。像这样的代词引用曾经是语言模型难以解决的主要问题之一，但 BERT 可以做到这一点

来源如果您是 NLP 爱好者，想知道什么是转换器及其工作原理背后的细节，您可以观看这段基于开创性文章的视频：Attention Is All You Need。

它们是一个很棒的视频和一篇优秀的论文（但老实说，它直接在我脑海中浮现）。对于我们其他麻瓜，BERT 背后的转换器的技术效果转化为更新，谷歌搜索可以更好地理解搜索结果背后的上下文，也就是用户意图。

BERT 使用掩码语言模型 (MLM)
BERT 的训练包括使用 Masked Language Modeling 预测句子中的单词。它的作用是掩盖句子中 15% 的单词，如下所示：

What do [MASK] eat other than bamboo?

然后，BERT 必须预测被掩码的词是什么。这做了两件事：它在单词上下文中训练 BERT，它提供了一种衡量 BERT 学习量的方法。被屏蔽的词阻止 BERT 学习复制和粘贴输入。

其他参数，例如向右移动解码器、下一个句子预测或回答上下文，有时是无法回答的问题也可以这样做。 BERT 提供的输出将表明 BERT 正在学习和实施其关于单词上下文的知识。

BERT 有什么影响？
这对搜索意味着什么？像 BERT 那样使用转换器双向映射查询尤为重要。

这意味着算法正在考虑诸如介词之类的单词背后的细微但有意义的细微差别，这些细微差别可能会极大地改变查询背后的意图。以这两个不同的搜索页面结果为例。我们将继续我们早期的熊猫和竹子主题。

关键字是：What do pandas eat other than bamboo

Panda bamboo

请注意结果页面非常相似？几乎一半的有机结果是相同的，人们也问 (PAA) 部分有一些非常相似的问题。但是，搜索意图非常不同。

“熊猫竹”的范围很广，所以很难确定其意图，但它可能在想熊猫的竹子饮食。搜索页面非常好。另一方面，“熊猫除了竹子还吃什么”的搜索意图非常具体，搜索页面上的结果完全错过了。

唯一接近达到意图的结果可能有两个 PAA 问题：

大熊猫吃什么肉？
只吃竹子的大熊猫如何生存？
可以说是 Quora 的两个问题，其中一个很有趣：

可以训练熊猫吃竹子以外的食物吗？
熊猫吃人吗？
苗条的采摘，确实。在此搜索查询中，“其他”一词在搜索意图的含义中起着重要作用。在 BERT 更新之前，Google 的算法在返回信息时会定期忽略诸如“other than”之类的功能/填充词。

这导致搜索页面无法匹配像这样的搜索意图。由于 BERT 仅影响 10% 的搜索查询，因此在撰写本文时左侧页面并未受到 BERT 的影响也就不足为奇了。 Google 在其 BERT 解释页面上提供的这个示例显示了 BERT 如何影响搜索结果：

精选片段
BERT 将产生的最重要影响之一将是精选片段。精选片段是有机的，并且依赖于机器学习算法，而 BERT 完全符合要求。精选片段结果最常从第一个搜索结果页面中提取，但现在可能会有一些例外。

因为它们是有机的，很多因素都可以使它们发生变化，包括像 BERT 这样的新算法更新。使用 BERT，影响精选片段的算法可以更好地分析搜索查询背后的意图，并更好地将搜索结果与它们匹配。 BERT 也很可能能够获取冗长的结果文本，找到核心概念，并将内容总结为特色片段。

国际搜索
由于语言具有相似的基本语法规则，BERT 可以提高翻译的准确性。 BERT 每次学习翻译一种新语言时，都会获得新的语言技能。这些技能可以转移并帮助 BERT 翻译它从未见过的更高精度的语言。

如何针对 BERT 优化我的网站？
现在我们遇到一个大问题：如何针对 Google BERT 进行优化？简短的回答？

你不能。 BERT 是一个人工智能框架。它利用它获得的每一条新信息进行学习。

它处理信息和做出决策的速度意味着即使是 BERT 的开发人员也无法预测 BERT 将做出的选择。很可能，BERT 甚至不知道它为什么会做出这样的决定。如果它不知道，那么 SEO 就无法直接针对它进行优化。

但是，您可以在搜索页面中进行排名的方法是继续生成符合搜索意图的人性化内容。 BERT 的目的是帮助 Google 了解用户意图，因此针对用户意图进行优化将针对 BERT 进行优化。

所以，做你一直在做的事情。
研究你的目标关键词。
关注用户并生成他们想要看到的内容。
最终，当你写内容时，问问自己：

我的读者能否在我的内容中找到他们正在寻找的内容？

作者 east

深度学习 3月 27,2022

什么是BERT模型和作用?

Google 最近进行了一项重要的算法更新，称为 Google BERT，以更好地理解搜索并为更自然的语言查询生成结果。算法更新还将为他们的人工智能技术提供自然语言和搜索上下文。每天数十亿次的搜索将有助于增强 Google 的 AI 功能，从而改善搜索结果、提高对语音搜索的理解，并帮助 Google 更好地了解消费者行为。

向 Google BERT 打个招呼！

BERT 是谷歌自 2015 年推出 RankBrain 以来最大的搜索算法。事实上，谷歌表示这次更新代表了“过去五年来最大的飞跃，也是搜索领域最大的飞跃之一。” BERT 通过了解用户在更具会话结构的查询中的意图，使搜索更加集中。

让我们更好地了解 BERT，并了解它如何帮助优化您的搜索。

什么是 BERT？
BERT 是一种人工智能 (AI) 系统，代表 Transformers 的双向编码器表示。这种搜索进步是谷歌对转换器研究的结果，转换器是处理与句子中所有其他单词相关的单词的模型，而不是按顺序一个接一个地处理。简而言之，此更新侧重于短语而不是简单的单词。

在排名结果方面，BERT 将影响十分之一的搜索查询。此算法更新也被应用于帮助为世界各地的人们提供更好的搜索。通过从一种语言中学习，相关结果可以应用于许多其他语言。 Google 正在许多国家/地区使用 BERT 模型来改进片段，支持韩语、印地语和葡萄牙语等 70 多种语言。

BERT+
然而，BERT 不仅仅是一种搜索算法。它也是一个机器学习自然语言处理框架、一个不断发展的计算效率工具，以及一个开源研究项目和学术论文，于 2018 年 10 月首次发表，名称为 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding。

怎么运作
BERT 的美妙之处在于，无论单词的拼写方式或它们在查询中的顺序，它都能计算出您的搜索并显示相关信息。 BERT 能够基于句子中的整个单词集而不是传统的单词序列来训练语言模型，例如从左到右或从左到右和从右到左的组合。谷歌现在可以解决由许多具有多种含义的单词组成的模棱两可的短语。

此外，日常语言中存在细微差别，计算机并不完全理解人类的行为方式。因此，当搜索包含一个短语时，BERT 将解释它并根据句子的创建方式和发音给出结果。这很重要，因为即使是最简单的短语与单数单词相比也可能具有完全不同的含义。例如，在“纽约到洛杉矶”和“四分之一到九点”这样的短语中，“到”这个词有不同的含义，这可能会导致搜索引擎混淆。 BERT 区分这些细微差别以促进更相关的搜索。

RankBrain 仍在努力
RankBrain 是谷歌第一个用于理解查询的人工智能方法。它同时查看搜索和谷歌索引中的网页内容，以更好地理解单词的含义。 BERT 不会取代 RankBrain，它是更好地理解内容、自然语言和查询的扩展。 RankBrain 仍将被使用，但当 Google 认为在 BERT 的帮助下更适合查询时，搜索将使用新模型。似乎这句谚语是真的……两种搜索算法比一种更好！

更智能的搜索结果
作为谷歌最新的算法更新，BERT 通过更好地理解自然语言来影响搜索，尤其是在会话短语中。 BERT 将影响大约 10% 的查询以及自然排名和精选片段。所以这对谷歌……和我们所有人来说都是一件大事。有这么多问题，找到与我们的“正常”短语查询匹配的相关结果肯定会让我们的搜索体验更加轻松。搜索愉快！

作者 east

深度学习 3月 27,2022

深入了解 BERT 模型的代码-分解 Hugging Face Bert 实现

已经有很多关于如何从头开始创建简化的 Bert 模型及其工作原理的教程。在本文中，我们将做一些稍微不同的事情——我们通过 BERT 的实际 Hugging face 实现分解其所有组件。

介绍
在过去的几年里，Transformer 模型彻底改变了 NLP 领域。 BERT (Bidirectional Encoder Representations from Transformers) 是最成功的 Transformer 之一——由于与 LSTM 的递归结构不同，通过注意力机制和训练时间更好地理解了上下文，它在性能上都优于以前的 SOTA 模型（如 LSTM）， BERT 是可并行的。
现在不用再等了，让我们深入研究代码，看看它是如何工作的。首先我们加载 Bert 模型并输出 BertModel 架构：

# with bertviz package we can output attentions and hidden states 
from bertviz.transformers_neuron_view import BertModel, BertConfig
from transformers import BertTokenizer

max_length = 256
config = BertConfig.from_pretrained("bert-base-cased", output_attentions=True, output_hidden_states=True, return_dict=True)
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
config.max_position_embeddings = max_length

model = BertModel(config)
model = model.eval()

display(model)
# output : 

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(256, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): BertLayerNorm()
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): BertLayerNorm()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): BertLayerNorm()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      
      ......

      (11): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): BertLayerNorm()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (pooler): BertPooler(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
  )
)

我们分别分析了 3 个部分：Embeddings、具有 12 个重复 Bert 层的 Encoder 和 Pooler。最终我们将添加一个分类层。
伯特嵌入：
从原始文本开始，首先要做的是将我们的句子拆分为标记，然后我们可以将其传递给 BertEmbeddings。我们使用基于 WordPiece 的 BertTokenizer——子词标记化可训练算法，有助于平衡词汇量和词汇量外的单词。看不见的词被分成子词，这些子词是在分词器的训练阶段派生的（这里有更多详细信息）。现在让我们从 20newsgroups 数据集中导入几个句子并标记它们

from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
inputs_tests = tokenizer(newsgroups_train['data'][:3], truncation=True, padding=True, max_length=max_length, return_tensors='pt')

一旦句子被分割成标记，我们就会为每个标记分配一个具有代表性的数字向量，该向量在 n 维空间中表示该标记。每个维度都包含该单词的一些信息，因此如果我们假设特征是 Wealth、Gender、Cuddly，则模型在训练嵌入层之后，将使用以下 3 维向量表示例如单词 king：(0.98, 1, 0.01)和 cat (0.02, 0.5, 1)。然后我们可以使用这些向量来计算单词之间的相似度（使用余弦距离）并做许多其他事情。
注意：实际上，我们无法得出这些特征名称的真正含义，但以这种方式思考它们有助于获得更清晰的画面。
所以 word_embeddings 在这种情况下是一个形状矩阵 (30522, 768)，其中第一个维度是词汇维度，而第二个维度是嵌入维度，即我们用来表示一个单词的特征的数量。对于 base-bert，它是 768，对于更大的型号，它会增加。一般来说，嵌入维度越高，我们可以更好地表示某些单词——这在一定程度上是正确的，在某些时候增加维度不会大大提高模型的准确性，而计算复杂度却可以。

model.embeddings.word_embeddings.weight.shape
output: torch.Size([30522, 768])

需要 position_embeddings 是因为，与 LSTM 模型不同，例如 LSTM 模型顺序处理令牌，因此通过构造具有每个令牌的顺序信息，Bert 模型并行处理令牌并合并每个令牌的位置信息，我们需要从 position_embeddings 矩阵添加此信息 . 它的形状是 (256, 768)，其中前者表示最大句子长度，而后者是词嵌入的特征维度——因此根据每个标记的位置，我们检索相关向量。在这种情况下，我们可以看到这个矩阵是学习的，但还有其他实现是使用正弦和余弦构建的。

model.embeddings.position_embeddings.weight.shapeoutput: torch.Size([256, 768])

token_type_embeddings 在这里是“冗余的”，来自 Bert 训练任务，其中评估了两个句子之间的语义相似性——需要这种嵌入来区分第一句和第二句。我们不需要它，因为我们只有一个用于分类任务的输入句子。
一旦我们为句子中的每个单词提取单词嵌入、位置嵌入和类型嵌入，我们只需将它们相加即可得到完整的句子嵌入。所以对于第一句话，它将是：

f1 = torch.index_select(model.embeddings.word_embeddings.weight, 0, inputs_tests['input_ids'][0])  # words embeddings
  + torch.index_select(model.embeddings.position_embeddings.weight, 0, torch.tensor(range(inputs_tests['input_ids'][0].size(0))).long()) \ # pos embeddings
 + torch.index_select(model.embeddings.token_type_embeddings.weight, 0, inputs_tests['token_type_ids'][0]) # token embeddings

对于我们的 3 个句子的 mini-batch，我们可以通过以下方式获取它们：

n_batch = 3
shape_embs = (inputs_tests['input_ids'].shape) + (model.embeddings.word_embeddings.weight.shape[1], )
w_embs_batch = torch.index_select(model.embeddings.word_embeddings.weight, 0, inputs_tests['input_ids'].reshape(1,-1).squeeze(0)).reshape(shape_embs)
pos_embs_batch = torch.index_select(model.embeddings.position_embeddings.weight, 0, 
                                    torch.tensor(range(inputs_tests['input_ids'][1].size(0))).repeat(1, n_batch).squeeze(0)).reshape(shape_embs)
type_embs_batch = torch.index_select(model.embeddings.token_type_embeddings.weight, 0, 
                                     inputs_tests['token_type_ids'].reshape(1,-1).squeeze(0)).reshape(shape_embs)
batch_all_embs = w_embs_batch + pos_embs_batch + type_embs_batch
batch_all_embs.shape # (batch_size, n_words, embedding dim)

接下来我们有一个 LayerNorm 步骤，它可以帮助模型更快地训练和更好地泛化。我们通过令牌的均值嵌入和标准差对每个令牌的嵌入进行标准化，使其具有零均值和单位方差。然后，我们应用经过训练的权重和偏差向量，以便可以将其转换为具有不同的均值和方差，以便训练期间的模型可以自动适应。因为我们独立于其他示例计算不同示例的均值和标准差，所以它与批量归一化不同，后者的归一化是跨批次维度的，因此取决于批次中的其他示例。

# single example normalization
ex1 = f1[0, :]
ex1_mean = ex1.mean()
ex1_std = (ex1 - ex1_mean).pow(2).mean()
norm_example = ((ex1- ex1_mean)/torch.sqrt(ex1_std + 1e-12))
norm_example_centered = model.embeddings.LayerNorm.weight * norm_example + model.embeddings.LayerNorm.bias


def layer_norm(x, w, b):
    mean_x = x.mean(-1, keepdim=True)
    std_x = (x - mean_x).pow(2).mean(-1, keepdim=True)
    x_std = (x - mean_x) / torch.sqrt(std_x + 1e-12)
    shifted_x = w * x_std + b
    return shifted_x
  
# batch normalization
norm_embs = layer_norm(batch_all_embs, model.embeddings.LayerNorm.weight, model.embeddings.LayerNorm.bias

让我们最后应用 Dropout，我们用零替换一些具有一定 dropout 概率的值。 Dropout 有助于减少过度拟合，因为我们随机阻止来自某些神经元的信号，因此网络需要找到其他路径来减少损失函数，因此它学会了如何更好地泛化而不是依赖某些路径。我们还可以将 dropout 视为一种模型集成技术，因为在每一步的训练过程中，我们随机停用某些神经元，最终形成“不同”的网络，最终在评估期间集成这些神经元。
注意：因为我们将模型设置为评估模式，我们将忽略所有的 dropout 层，它们仅在训练期间使用。为了完整起见，我们仍将其包括在内。

norm_embs_dropout = model.embeddings.dropout(norm_embs)

我们可以检查我们是否获得了与模型相同的结果：

embs_model = model.embeddings(inputs_tests[‘input_ids’], inputs_tests[‘token_type_ids’])
torch.allclose(embs_model, norm_embs, atol=1e-06) # True

编码器
编码器是最神奇的地方。有 12 个 BertLayers，前一个的输出被馈送到下一个。这是使用注意力来创建与上下文相关的原始嵌入的不同表示的地方。在 BertLayer 中，我们首先尝试理解 BertAttention——在导出每个单词的嵌入之后，Bert 使用 3 个矩阵——Key、Query 和 Value，来计算注意力分数，并根据句子中的其他单词导出单词嵌入的新值；通过这种方式，Bert 是上下文感知的，每个单词的嵌入而不是固定的，上下文独立是基于句子中的其他单词推导出来的，并且在为某个单词推导新嵌入时其他单词的重要性由注意力分数表示。为了导出每个单词的查询和键向量，我们需要将其嵌入乘以经过训练的矩阵（查询和键是分开的）。例如，要导出第一句的第一个词的查询向量：

att_head_size = int(model.config.hidden_size/model.config.num_attention_heads)
n_att_heads = model.config.num_attention_heads
norm_embs[0][0, :] @ model.encoder.layer[0].attention.self.query.weight.T[:, :att_head_size] + \
                      model.encoder.layer[0].attention.self.query.bias[:att_head_size]

我们可以注意到，在整个查询和关键矩阵中，我们只选择了前 64 个 (=att_head_size) 列（原因将在稍后说明）——这是转换后单词的新嵌入维度，它小于原始嵌入维度 768。这样做是为了减少计算负担，但实际上更长的嵌入可能会带来更好的性能。实际上，这是降低复杂性和提高性能之间的权衡。
现在我们可以推导出整个句子的 Query 和 Key 矩阵：

Q_first_head = norm_embs[0] @ model.encoder.layer[0].attention.self.query.weight.T[:, :att_head_size] + \
               model.encoder.layer[0].attention.self.query.bias[:att_head_size] 
K_first_head = norm_embs[0] @ model.encoder.layer[0].attention.self.key.weight.T[:, :att_head_size] + \
               model.encoder.layer[0].attention.self.key.bias[:att_head_size]

为了计算注意力分数，我们将 Query 矩阵乘以 Key 矩阵，并将其标准化为新嵌入维度的平方根 (=64=att_head_size)。我们还添加了一个修改后的注意力掩码。初始注意掩码 (inputs[‘attention_mask’][0]) 是一个 1 和 0 的张量，其中 1 表示该位置有一个标记，0 表示它是一个填充标记。
如果我们从 1 中减去注意力掩码并将其乘以一个高负数，当我们应用 SoftMax 时，我们实际上将那些负值发送到零，然后根据其他值推导出概率。让我们看下面的例子：
如果我们有一个 3 个标记 + 2 个填充的句子，我们会得到以下注意力掩码：[0,0,0, -10000, -10000]
让我们应用 SoftMax 函数：

torch.nn.functional.softmax(torch.tensor([0,0,0, -10000, -10000]).float())# tensor([0.3333, 0.3333, 0.3333, 0.0000, 0.0000])mod_attention = (1.0 – inputs[‘attention_mask’][[0]]) * -10000.0attention_scores = torch.nn.Softmax(dim=-1)((Q_first_head @ K_first_head.T)/ math.sqrt(att_head_size) + mod_attention)

让我们检查一下我们得到的注意力分数是否与我们从模型中得到的相同。我们可以使用以下代码从模型中获取注意力分数：

as we defined output_attentions=True, output_hidden_states=True, return_dict=True we will get last_hidden_state, pooler_output, hidden_states for each layer and attentions for each layer
out_view = model(**inputs_tests)

out_view 包含：
last_hidden_state (batch_size, sequence_length, hidden_size) : 最后一个 BertLayer 输出的隐藏状态
pooler_output (batch_size, hidden_size) : Pooler 层的输出
hidden_states (batch_size, sequence_length, hidden_size)：模型在每个 BertLayer 输出的隐藏状态加上初始嵌入
注意（batch_size、num_heads、sequence_length、sequence_length）：每个 BertLayer 一个。注意力 SoftMax 后的注意力权重

torch.allclose(attention_scores, out_view[-1][0][‘attn’][0, 0, :, :], atol=1e-06)) # True
print(attention_scores[0, :])
tensor([1.0590e-04, 2.1429e-03, .... , 4.8982e-05], grad_fn=<SliceBackward>)

注意分数矩阵的第一行表示，要为第一个标记创建新嵌入，我们需要注意权重 = 1.0590e-04 的第一个标记（对自身），权重 = 2.1429e-03 的第二个标记等等。换句话说，如果我们将这些分数乘以其他标记的向量嵌入，我们会得出第一个标记的新表示，但是，我们将使用下面计算的值矩阵，而不是实际使用嵌入。
值矩阵的推导方式与查询和键矩阵相同：

V_first_head = norm_embs[0] @ model.encoder.layer[0].attention.self.value.weight.T[:, :att_head_size] + \
              model.encoder.layer[0].attention.self.value.bias[:att_head_size]

然后我们将这些值乘以注意力分数以获得新的上下文感知词表示

new_embed_1 = (attention_scores @ V_first_head)

现在您可能想知道，为什么我们要从张量中选择前 64 个 (=att_head_size) 元素。好吧，我们上面计算的是 Bert 注意力层的一个头，但实际上有 12 个。这些注意力头中的每一个都会创建不同的单词表示（new_embed_1 矩阵），例如，给定以下句子“ I like to eat pizza in the Italian restaurants ”，在第一个头中，“pizza”一词可能主要关注前一个单词，单词本身以及后面的单词和剩余单词的注意力将接近于零。在下一个头中，它可能会关注所有动词（like 和 eat），并以这种方式捕捉与第一个头不同的关系。
现在，我们可以以矩阵形式将它们一起推导，而不是单独推导每个头部：

Q = norm_embs @ model.encoder.layer[0].attention.self.query.weight.T + model.encoder.layer[0].attention.self.query.bias
K = norm_embs @ model.encoder.layer[0].attention.self.key.weight.T + model.encoder.layer[0].attention.self.key.bias
V = norm_embs @ model.encoder.layer[0].attention.self.value.weight.T + model.encoder.layer[0].attention.self.value.bias
new_x_shape = Q.size()[:-1] + (n_att_heads, att_head_size)
new_x_shape # torch.Size([3, 55, 12, 64])
Q_reshaped = Q.view(*new_x_shape)
K_reshaped = K.view(*new_x_shape)
V_reshaped = V.view(*new_x_shape)
att_scores = (Q_reshaped.permute(0, 2, 1, 3) @ K_reshaped.permute(0, 2, 1, 3).transpose(-1, -2))
att_scores = (att_scores/ math.sqrt(att_head_size)) + extended_attention_mask
attention_probs = torch.nn.Softmax(dim=-1)(att_scores)

第一个例子和第一个 head 的注意力和我们之前推导出的一样：

example = 0
head = 0
torch.allclose(attention_scores, attention_probs[example][head]) # True

我们现在将 12 个头的结果连接起来，并将它们传递给我们已经在嵌入部分中看到的一堆线性层、归一化层和 dropout，以获得第一层的编码器结果。

att_heads = []
for i in range(12):
  att_heads.append(attention_probs[0][i] @ V_reshaped[0, : , i, :])
output_dense = torch.cat(att_heads, 1) @ model.encoder.layer[0].attention.output.dense.weight.T + \
               model.encoder.layer[0].attention.output.dense.bias
output_layernorm = layer_norm(output_dense + norm_embs[0], 
                              model.encoder.layer[0].attention.output.LayerNorm.weight, 
                              model.encoder.layer[0].attention.output.LayerNorm.bias)
interm_dense = torch.nn.functional.gelu(output_layernorm @ model.encoder.layer[0].intermediate.dense.weight.T + \
                                        model.encoder.layer[0].intermediate.dense.bias)
out_dense = interm_dense @ model.encoder.layer[0].output.dense.weight.T + model.encoder.layer[0].output.dense.bias
out_layernorm  = layer_norm(out_dense + output_layernorm, 
                            model.encoder.layer[0].output.LayerNorm.weight, 
                            model.encoder.layer[0].output.LayerNorm.bias)

output_dense 我们只是通过线性层传递连接的注意力结果。然后我们需要进行归一化，但我们可以看到，我们不是立即对 output_dense 进行归一化，而是首先将其与我们的初始嵌入相加——这称为残差连接。当我们增加神经网络的深度时，即堆叠越来越多的层时，我们会遇到梯度消失/爆炸的问题，当梯度消失的情况下，模型无法再学习，因为传播的梯度接近于零初始层停止改变权重并改进。当权重因极端更新而最终爆炸（趋于无穷大）而无法稳定时，梯度爆炸的相反问题。现在，正确初始化权重和归一化有助于解决这个问题，但观察到的是，即使网络变得更加稳定，性能也会随着优化的困难而下降。添加这些残差连接有助于提高性能，即使我们不断增加深度，网络也变得更容易优化。 out_layernorm 中也使用了残差连接，它实际上是第一个 BertLayer 的输出。最后要注意的是，当我们计算 interterm_dense 时，在将 AttentionLayer 的输出传递到线性层之后，会应用非线性 GeLU 激活函数。 GeLU 表示为：

查看图表我们可以看到，如果由公式 max(input, 0) 给出的 ReLU 在正域中是单调的、凸的和线性的，那么 GeLU 在正域中是非单调的、非凸的和非线性的正域，因此可以逼近更容易复杂的函数。

我们现在已经成功地复制了整个 BertLayer。该层的输出（与初始嵌入的形状相同）进入下一个 BertLayer，依此类推。总共有 12 个 BertLayers。因此，将所有这些放在一起，我们可以从编码器中获得所有 3 个示例的最终结果：

n_batch = 3
tot_n_layers = 12
tot_n_heads = 12
shape_embs = (inputs_tests['input_ids'].shape) + (model.embeddings.word_embeddings.weight.shape[1], )
w_embs_batch = torch.index_select(model.embeddings.word_embeddings.weight, 
                                  0, inputs_tests['input_ids'].reshape(1,-1).squeeze(0)).reshape(shape_embs)
pos_embs_batch = torch.index_select(model.embeddings.position_embeddings.weight, 0, 
                                    torch.tensor(range(inputs_tests['input_ids'][1].size(0))).repeat(1, n_batch).squeeze(0)).reshape(shape_embs)
type_embs_batch = torch.index_select(model.embeddings.token_type_embeddings.weight, 0, 
                                     inputs_tests['token_type_ids'].reshape(1,-1).squeeze(0)).reshape(shape_embs)
batch_all_embs = w_embs_batch + pos_embs_batch + type_embs_batch
normalized_embs = layer_norm(batch_all_embs, model.embeddings.LayerNorm.weight, model.embeddings.LayerNorm.bias)
extended_attention_mask = inputs['attention_mask'].unsqueeze(1).unsqueeze(2)
extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
for layer_n in range(tot_n_layers):
 if layer_n == 0:
   # compute Q, K and V matrices
   Q = normalized_embs @ model.encoder.layer[layer_n].attention.self.query.weight.T + \ 
                         model.encoder.layer[layer_n].attention.self.query.bias
   K = normalized_embs @ model.encoder.layer[layer_n].attention.self.key.weight.T + \
                         model.encoder.layer[layer_n].attention.self.key.bias
   V = normalized_embs @ model.encoder.layer[layer_n].attention.self.value.weight.T + \
                         model.encoder.layer[layer_n].attention.self.value.bias
   # reshape
   new_x_shape = Q.size()[:-1] + (n_att_heads, att_head_size)
   Q_reshaped = Q.view(*new_x_shape)
   K_reshaped = K.view(*new_x_shape)
   V_reshaped = V.view(*new_x_shape)
   # compute attention probabilities
   att_scores = (Q_reshaped.permute(0, 2, 1, 3) @ K_reshaped.permute(0, 2, 1, 3).transpose(-1, -2))
   att_scores = (att_scores/ math.sqrt(att_head_size)) + extended_attention_mask
   attention_probs = torch.nn.Softmax(dim=-1)(att_scores)
   # concatenate attention heads
   att_heads = []
   for i in range(tot_n_heads):
    att_heads.append(attention_probs[:, i] @ V_reshaped[:, : , i, :])

   output_dense = torch.cat(att_heads, 2) @ model.encoder.layer[layer_n].attention.output.dense.weight.T + \
                                            model.encoder.layer[layer_n].attention.output.dense.bias
   # normalization + residual connection
   output_layernorm = layer_norm(output_dense + normalized_embs, 
                                 model.encoder.layer[layer_n].attention.output.LayerNorm.weight,
                                 model.encoder.layer[layer_n].attention.output.LayerNorm.bias)
   # linear layer + non linear gelu activation
   interm_dense = torch.nn.functional.gelu(output_layernorm @ model.encoder.layer[layer_n].intermediate.dense.weight.T + \
                                           model.encoder.layer[layer_n].intermediate.dense.bias)
   # linear layer
   out_dense = interm_dense @ model.encoder.layer[layer_n].output.dense.weight.T + model.encoder.layer[layer_n].output.dense.bias
   # normalization + residual connection
   out_layernorm = layer_norm(out_dense + output_layernorm, 
                              model.encoder.layer[layer_n].output.LayerNorm.weight, 
                              model.encoder.layer[layer_n].output.LayerNorm.bias)
 else:
   # compute Q, K and V matrices
   Q = out_layernorm @ model.encoder.layer[layer_n].attention.self.query.weight.T + \
                              model.encoder.layer[layer_n].attention.self.query.bias
   K = out_layernorm @ model.encoder.layer[layer_n].attention.self.key.weight.T + \
                              model.encoder.layer[layer_n].attention.self.key.bias
   V = out_layernorm @ model.encoder.layer[layer_n].attention.self.value.weight.T + \
                              model.encoder.layer[layer_n].attention.self.value.bias
   # reshape
   Q_reshaped = Q.view(*new_x_shape)
   K_reshaped = K.view(*new_x_shape)
   V_reshaped = V.view(*new_x_shape)
   # compute attention probabilities
   att_scores = (Q_reshaped.permute(0, 2, 1, 3) @ K_reshaped.permute(0, 2, 1, 3).transpose(-1, -2))
   att_scores = (att_scores/ math.sqrt(att_head_size)) + extended_attention_mask
   attention_probs = torch.nn.Softmax(dim=-1)(att_scores)
   # concatenate attention heads
   att_heads = []
   for i in range(tot_n_heads):
    att_heads.append(attention_probs[:, i] @ V_reshaped[:, : , i, :])

   output_dense = torch.cat(att_heads, 2) @ model.encoder.layer[layer_n].attention.output.dense.weight.T + \
                                            model.encoder.layer[layer_n].attention.output.dense.bias
   # normalization + residual connection
   output_layernorm = layer_norm(output_dense + out_layernorm, 
                                 model.encoder.layer[layer_n].attention.output.LayerNorm.weight, 
                                 model.encoder.layer[layer_n].attention.output.LayerNorm.bias)

   # linear layer + non linear gelu activation
   interm_dense = torch.nn.functional.gelu(output_layernorm @ model.encoder.layer[layer_n].intermediate.dense.weight.T + \
                                                              model.encoder.layer[layer_n].intermediate.dense.bias)
   # linear layer
   out_dense = interm_dense @ model.encoder.layer[layer_n].output.dense.weight.T + model.encoder.layer[layer_n].output.dense.bias
   # normalization + residual connection
   out_layernorm = layer_norm(out_dense + output_layernorm, 
                              model.encoder.layer[layer_n].output.LayerNorm.weight, 
                              model.encoder.layer[layer_n].output.LayerNorm.bias)

注意 out_layernorm – 每层的输出如何被馈送到下一层。
我们可以看到这与 out_view 中的结果相同


 torch.allclose(out_view[-2][-1], out_layernorm, atol=1e-05) # True

Pooler
现在我们可以获取最后一个 BertLayer 的第一个令牌输出，即 [CLS]，将其通过一个线性层并应用一个 Tanh 激活函数来获得池化输出。使用第一个标记进行分类的原因来自于模型是如何被训练为 Bert state 的作者的：
每个序列的第一个标记始终是一个特殊的分类标记 ([CLS])。与该标记对应的最终隐藏状态用作分类任务的聚合序列表示。


 out_pooler = torch.nn.functional.tanh(out_layernorm[:, 0] @ model.pooler.dense.weight.T + model.pooler.dense.bias)

分类器
最后，我们创建一个简单的类，它将是一个简单的线性层，但您可以向它添加一个 dropout 和其他东西。我们在这里假设一个二元分类问题（output_dim=2），但它可以是任何维度的。

from torch import nn
class Classifier(nn.Module):
    
    def __init__(self, output_dim=2):
        super(Classifier, self).__init__()
        self.classifier = nn.Linear(model.config.hidden_size, output_dim, bias=True)
    
    def forward(self, x):
        return self.classifier(x)
classif = Classifier()
classif(out_pooler)
tensor([[-0.2918, -0.5782],
        [ 0.2494, -0.1955],
        [ 0.1814,  0.3971]], grad_fn=<AddmmBackward>)

引用：

 
https://arxiv.org/pdf/1606.08415v3.pdf
https://arxiv.org/pdf/1810.04805.pdf
https://jalammar.github.io/illustrated-transformer/
https://github.com/huggingface/transformers/

作者 east

标签归档深度学习