fastText的源码，看这一篇就够了！

原创王文彬 AI老炮观察站

2018年12月03日 17:51

面对一个新的模型，我们可能会想这个模型谁提出的，可以做什么。然后考虑是否有需要，开始用之前，会问这个模型支持哪些参数，数据格式是什么，至此，可以试试效果。试完效果发现咦还不错，后面可能会横向、纵向比较。使用一段时候开始研究源码，除可以弄清楚原理外，还可以学习很多编程技巧，知其所以然。因此，本解析将会围绕这些展开讨论。

1 fasteText能干什么

1.1 fastText是什么

fastText是Facebook AI Reserch在16年开源的一个词向量及文本分类工具，性能比肩深度学习而且速度更快，能够训练模型“在使用标准多核CPU的情况下10分钟内处理超过10亿个词汇”，特别是与深度模型对比，fastText能将训练时间由数天缩短到几秒钟。并且，它在模型架构上跟word2vec非常相似，两者作者都是Tomas Mikolov。

1.2 fastText支持哪些参数

现在已经知道fastText能够干什么，而且词向量已经有成名的word2vec，所以让我们聚焦在fastText的另外一个功能分类。在使用模型前，比较重要的就是模型提供了什么参数，具体参考fastText源码[1]。

fastText 提供的几个功能

参数	说明
supervised	监督学习，训练分类器
test	评价一个分类器
predict	预测标签
predict-prob	预测标签，并输出预测概率
skipgram	训练skipgram模型
cbow	训练cbow模型
print-vectors	输入一个训练好的模型，打印向量

fastText 分类提供的参数

参数	说明	默认值
-input	training file path
-output	output file path
-verbose	verbosity level	2
-minCount	minimal number of word occurrences	[1]
-minCountLabel	minimal number of label occurrences	[0]
-wordNgrams	word ngram的最大长度	[1]
-bucket	提供word 和char ngram最大值	[2000000]
-minn	char ngram的最小长度	[0]
-maxn	char ngram 的最大长度	[0]
-t	sampling threshold	[0.0001]
-label	labels prefix	[__label__]
-lr	learning rate	[0.1]
-lrUpdateRate	change the rate of updates for the learning rate	[100]
-dim	size of word vectors	[100]
-ws	size of the context window	[5]
-epoch	number of epochs	[5]
-neg	number of negatives sampled	[5]
-loss	loss function {ns, hs, softmax}	[softmax]
-thread	number of threads	[12]
-pretrainedVectors	对于分类问题可以提供与训练的词向量	[]
-saveOutput	whether output params should be saved	[0]

1.3 fastText 输入输出

输入:一行为一条训练数据,每个词可以用空格等常用分隔符分割,对于分类问题来说,在类别前面加"__label__"表示其为该行训练数据的标签,当然可以自定义了,可以使用-label自定义,训练文件支持 UTF-8 格式,字符会按照UTF-8编码分割。

输出:输出支持topK,带概率.

1.4 fastText 为什么如此快

多线程，不加锁，同时更新参数，会带来一点点噪音，但是影响不大。
模型抽象，最后一层都抽象为LR，预先计算sigmoid的值。见model模块分析。
模型本身足够简单。

2 fastText与word2vec的异同

如果有对word2vec原理不清楚的可以看这篇博客word2vec[2]，详细介绍了word2vec的数学原理。

2.1 相同点

了解word2vec的同学，可以发现fasttext的模型结构和思想基本一致，倒是省去了理解的步骤。

2.2 不同点

增加分类
增加word和char ngram

3 源码解析

fasttext整体的结构图如下：$$x_i$$为根据dictionary找到每个词对应的id，根据id取对应的向量。然后，求和平均得到中间隐变量h。最后，经过Negative Sampling、Softmax或者Hierarchical Softmax进行分类。

3.1 源码结构

源码提供了以下几个模块，核心model模块提供各种算法：

3.2 头文件解析

类	说明
args.h	提供主要参数解析
dictionary.h	主要模块，提供计数、映射、ngram
fasttext.h	集合各个类
matrix.h	词向量矩阵
model.h	核心模块，各种算法
real.h	数据类型
utils.h	文件相关的工具类
vector.h	向量抽象类

3.3 主要类fasttext.h

fasttext 是最顶层的模块，它的主要功能是训练和预测，首先是训练功能的调用路径，第一个函数是 train，它的主要作用是初始化参数，启动多线程训练。

//  训练
void FastText::train(const Args args) {
  args_ = std::make_shared<Args>(args);
  dict_ = std::make_shared<Dictionary>(args_);
  if (args_->input == "-") {
    // manage expectations
    throw std::invalid_argument("Cannot use stdin for training!");
  }
  std::ifstream ifs(args_->input);
  if (!ifs.is_open()) {
    throw std::invalid_argument(
        args_->input + " cannot be opened for training!");
  }
  // 字典的构建
  dict_->readFromFile(ifs);
  ifs.close();
  // 加载预训练的词向量
  if (args_->pretrainedVectors.size() != 0) {
    loadVectors(args_->pretrainedVectors);
  } else {
    // 否则随机生成词典，词典的大小等于词的个数加上ngram个个数或者说是参数bucket的大小
    input_ = std::make_shared<Matrix>(dict_->nwords()+args_->bucket, args_->dim);
    input_->uniform(1.0 / args_->dim);
  }
  // 区分是训练词向量还是分了，本质上一样，只是一个目标是词一个是类别
  // 因此输出矩阵大大小不一样
  if (args_->model == model_name::sup) {
    output_ = std::make_shared<Matrix>(dict_->nlabels(), args_->dim);
  } else {
    output_ = std::make_shared<Matrix>(dict_->nwords(), args_->dim);
  }
  output_->zero();
  startThreads();
  model_ = std::make_shared<Model>(input_, output_, args_, 0);
  if (args_->model == model_name::sup) {
    model_->setTargetCounts(dict_->getCounts(entry_type::label));
  } else {
    model_->setTargetCounts(dict_->getCounts(entry_type::word));
  }
}

具体多线程训练部分，按照线程数把文件分成多个部分，每个部分独立更新参数，线程之间并没有加锁。因此，会带来一点噪音，但是对最后的结果没有多大影响。word2vec 实现也没有加锁。

void FastText::trainThread(int32_t threadId) {
  std::ifstream ifs(args_->input);
  // 根据线程id选择要训练的数据，按照字节数分割，不是按照行分割
  utils::seek(ifs, threadId * utils::size(ifs) / args_->thread);
  // 每个线程都会新建一个model，但是model参数是共享的
  Model model(input_, output_, args_, threadId);
  // 区分是词向量还是分类
  if (args_->model == model_name::sup) {
    model.setTargetCounts(dict_->getCounts(entry_type::label));
  } else {
    model.setTargetCounts(dict_->getCounts(entry_type::word));
  }
  const int64_t ntokens = dict_->ntokens();
  int64_t localTokenCount = 0;
  std::vector<int32_t> line, labels;
  while (tokenCount_ < args_->epoch * ntokens) {
    real progress = real(tokenCount_) / (args_->epoch * ntokens);
    real lr = args_->lr * (1.0 - progress);
    // 不同的模型选择不同的训练方式
    if (args_->model == model_name::sup) {
      localTokenCount += dict_->getLine(ifs, line, labels);
      supervised(model, lr, line, labels);
    } else if (args_->model == model_name::cbow) {
      localTokenCount += dict_->getLine(ifs, line, model.rng);
      cbow(model, lr, line);
    } else if (args_->model == model_name::sg) {
      localTokenCount += dict_->getLine(ifs, line, model.rng);
      skipgram(model, lr, line);
    }
    // 迭代过程中，根据迭代次数，降低学习率
    if (localTokenCount > args_->lrUpdateRate) {
      tokenCount_ += localTokenCount;
      localTokenCount = 0;
      if (threadId == 0 && args_->verbose > 1)
        loss_ = model.getLoss();
    }
  }
  if (threadId == 0)
    loss_ = model.getLoss();
  ifs.close();
}

你没有看错，fasttext提供的三个功能，它们的代码很相似，最后的更新后抽象成model.update(line, labels[i], lr)。supervised的输入就是词对应的id、标签、学习率等。model的update见3.5 model分析。

void FastText::supervised(
    Model& model,
    real lr,
    const std::vector<int32_t>& line,
    const std::vector<int32_t>& labels) {
  if (labels.size() == 0 || line.size() == 0) return;
  // 支持一条样本多个标签，随机选择一个标签
  std::uniform_int_distribution<> uniform(0, labels.size() - 1);
  int32_t i = uniform(model.rng);
  // 更新参数
  model.update(line, labels[i], lr);
}
void FastText::cbow(Model& model, real lr,
                    const std::vector<int32_t>& line) {
  std::vector<int32_t> bow;
  std::uniform_int_distribution<> uniform(1, args_->ws);
  for (int32_t w = 0; w < line.size(); w++) {
    int32_t boundary = uniform(model.rng);
    bow.clear();
    // 选择上下文，预测中间词
    for (int32_t c = -boundary; c <= boundary; c++) {
      if (c != 0 && w + c >= 0 && w + c < line.size()) {
        const std::vector<int32_t>& ngrams =
                              dict_->getSubwords(line[w + c]);
        bow.insert(bow.end(), ngrams.cbegin(), ngrams.cend());
      }
    }
    model.update(bow, line[w], lr);
  }
}
void FastText::skipgram(Model& model, real lr,
                        const std::vector<int32_t>& line) {
  std::uniform_int_distribution<> uniform(1, args_->ws);
  for (int32_t w = 0; w < line.size(); w++) {
    int32_t boundary = uniform(model.rng);
    const std::vector<int32_t>& ngrams = dict_->getSubwords(line[w]);
    // 根据中间词，预测上下文
    for (int32_t c = -boundary; c <= boundary; c++) {
      if (c != 0 && w + c >= 0 && w + c < line.size()) {
            model.update(ngrams, line[w + c], lr);
      }
    }
  }
}

当然fasttext.h还提供了预测的代码，相对训练来说就很简单了，主要就是调用了model的predict函数，会在model中介绍。

3.4 主要类 dictionary.h

dictionary类中词的保存格式如下，还保存了类型和char gram。

struct entry {
  std::string word;  // 词
  int64_t count;  // 出现的次数
  entry_type type; // 类型，是词还是类别标签
  std::vector<int32_t> subwords;  //char gram
};

整个dictionary类的主要函数功能说明

int32_t Dictionary::find(const std::string& w, uint32_t h) const {
  int32_t word2intsize = word2int_.size();
  int32_t id = h % word2intsize;
  // 根据word 和 hash值 查找其id
  while (word2int_[id] != -1 && words_[word2int_[id]].word != w) {
    id = (id + 1) % word2intsize;
  }
  return id;
}
void Dictionary::add(const std::string& w) {
  int32_t h = find(w);
  ntokens_++;
  // 将词添加到words_中
  if (word2int_[h] == -1) {
    entry e;
    e.word = w;
    e.count = 1;
    e.type = getType(w);
    words_.push_back(e);
    word2int_[h] = size_++;
  } else {
    words_[word2int_[h]].count++;
  }
}
void Dictionary::getSubwords(const std::string& word,
                           std::vector<int32_t>& ngrams,
                           std::vector<std::string>& substrings) const {
  int32_t i = getId(word);
  ngrams.clear();
  substrings.clear();
  if (i >= 0) {
    ngrams.push_back(i);
    substrings.push_back(words_[i].word);
  }
  if (word != EOS) {
    computeSubwords(BOW + word + EOW, ngrams, &substrings);
  }
}
uint32_t Dictionary::hash(const std::string& str) const {
  uint32_t h = 2166136261;
  for (size_t i = 0; i < str.size(); i++) {
    h = h ^ uint32_t(int8_t(str[i]));
    h = h * 16777619;
  }
  return h;
}
void Dictionary::computeSubwords(const std::string& word,
                               std::vector<int32_t>& ngrams,
                               std::vector<std::string>* substrings) const {
  for (size_t i = 0; i < word.size(); i++) {
    std::string ngram;
    if ((word[i] & 0xC0) == 0x80) continue;
    for (size_t j = i, n = 1; j < word.size() && n <= args_->maxn; n++) {
      ngram.push_back(word[j++]);
      while (j < word.size() && (word[j] & 0xC0) == 0x80) {
        ngram.push_back(word[j++]);
      }
      if (n >= args_->minn && !(n == 1 && (i == 0 || j == word.size()))) {
        int32_t h = hash(ngram) % args_->bucket;
        pushHash(ngrams, h);
        if (substrings) {
          substrings->push_back(ngram);
        }
      }
    }
  }
}
void Dictionary::threshold(int64_t t, int64_t tl) {
  sort(words_.begin(), words_.end(), [](const entry& e1, const entry& e2) {
      if (e1.type != e2.type) return e1.type < e2.type;
      return e1.count > e2.count;
    });
  words_.erase(remove_if(words_.begin(), words_.end(), [&](const entry& e) {
        return (e.type == entry_type::word && e.count < t) ||
               (e.type == entry_type::label && e.count < tl);
      }), words_.end());
  words_.shrink_to_fit();
  size_ = 0;
  nwords_ = 0;
  nlabels_ = 0;
  std::fill(word2int_.begin(), word2int_.end(), -1);
  for (auto it = words_.begin(); it != words_.end(); ++it) {
    int32_t h = find(it->word);
    word2int_[h] = size_++;
    if (it->type == entry_type::word) nwords_++;
    if (it->type == entry_type::label) nlabels_++;
  }
}
void Dictionary::addWordNgrams(std::vector<int32_t>& line,
                               const std::vector<int32_t>& hashes,
                               int32_t n) const {
  for (int32_t i = 0; i < hashes.size(); i++) {
    uint64_t h = hashes[i];
    for (int32_t j = i + 1; j < hashes.size() && j < i + n; j++) {
      h = h * 116049371 + hashes[j];
      pushHash(line, h % args_->bucket);
    }
  }
}
void Dictionary::addSubwords(std::vector<int32_t>& line,
                             const std::string& token,
                             int32_t wid) const {
  if (wid < 0) { // out of vocab
    if (token != EOS) {
      computeSubwords(BOW + token + EOW, line);
    }
  } else {
    if (args_->maxn <= 0) { // in vocab w/o subwords
      line.push_back(wid);
    } else { // in vocab w/ subwords
      const std::vector<int32_t>& ngrams = getSubwords(wid);
      line.insert(line.end(), ngrams.cbegin(), ngrams.cend());
    }
  }
}

3.5 核心类 model.h

盗图一张，很明白清楚的指出了model类做了些什么事情，参数和源码参数保持一致。

所有核心算法都集中到了model类，里面提供了各种实现，包括LR，Huffman、BP等，还是很有意思的。同时这个模块把negativeSampling、hierarchicalSoftmax都抽象成LR。

首先，让我们看下binaryLogistic这个函数，损失函数为常见的交叉熵损失函数。如果是正类，则为f(x)=-log(score)，如果为负类，则为：f(x)=-log(1-score) 现在看下其导数是什么，为了保持和代码一致，都使用源码的符号，推导如下：我们令x=wo->dotRow(hidden, target)，我们都知道sigmoid函数的偏导为socre*(1-socre)，对于正类而言：

我们的目标是使得损失函数最小，所以是梯度下降，再结合学习率，因此hidden_的梯度更新如下：

real Model::binaryLogistic(int32_t target, bool label, real lr) {
  real score = sigmoid(wo_->dotRow(hidden_, target));
  real alpha = lr * (real(label) - score);
  // 参考上文公式推导
  grad_.addRow(*wo_, target, alpha);
  wo_->addRow(hidden_, target, alpha);
  if (label) {
    return -log(score);
  } else {
    return -log(1.0 - score);
  }
}

具体negativeSampling、hierarchicalSoftmax、Softmax的实现

real Model::negativeSampling(int32_t target, real lr) {
  real loss = 0.0;
  grad_.zero();
  for (int32_t n = 0; n <= args_->neg; n++) {
    // 负采样，一个正样本，n个负样本
    if (n == 0) {
      loss += binaryLogistic(target, true, lr);
    } else {
      loss += binaryLogistic(getNegative(target), false, lr);
    }
  }
  return loss;
}
real Model::hierarchicalSoftmax(int32_t target, real lr) {
  real loss = 0.0;
  grad_.zero();
  const std::vector<bool>& binaryCode = codes[target];
  const std::vector<int32_t>& pathToRoot = paths[target];
  // 根据目标点，找到其路径，和对应的二分类的标签
  for (int32_t i = 0; i < pathToRoot.size(); i++) {
    loss += binaryLogistic(pathToRoot[i], binaryCode[i], lr);
  }
  return loss;
}
real Model::softmax(int32_t target, real lr) {
  grad_.zero();
  // 计算和所有的label之间的值，如果为目标值，则对应为1，否则为0
  computeOutputSoftmax();
  for (int32_t i = 0; i < osz_; i++) {
    real label = (i == target) ? 1.0 : 0.0;
    real alpha = lr * (label - output_[i]);
    grad_.addRow(*wo_, i, alpha);
    wo_->addRow(hidden_, i, alpha);
  }
  return -log(output_[target]);
}

看一下隐藏层的计算，需要注意的是：求和平均

void Model::computeHidden(const std::vector<int32_t>& input, Vector& hidden)
    const {
  assert(hidden.size() == hsz_);
  hidden.zero();
  for (auto it = input.cbegin(); it != input.cend(); ++it) {
    if (quant_) {
      hidden.addRow(*qwi_, *it);
    } else {
      hidden.addRow(*wi_, *it);
    }
  }
  // 求和平均
  hidden.mul(1.0 / input.size());
}

预测的代码，只有当用于分类时，其才有用，基本操作，用堆保留最大的几个值。如果为最后一层为hierarchicalSoftmax，这需要遍历整个Huffman树，找其topK，对应源码dfs函数，实现了深度优先搜索。如果为Softmax，直接计算即可，找其topK，对应源码findKBest。

void Model::predict(
    const std::vector<int32_t>& input,
    int32_t k,
    real threshold,
    std::vector<std::pair<real, int32_t>>& heap,
    Vector& hidden,
    Vector& output) const {
  // 判断k值是否符合要求
  if (k == Model::kUnlimitedPredictions) {
    k = osz_;
  } else if (k <= 0) {
    throw std::invalid_argument("k needs to be 1 or higher!");
  }
  // 判断当前模型是否是分类模型
  if (args_->model != model_name::sup) {
    throw std::invalid_argument("Model needs to be supervised for prediction!");
  }
  heap.reserve(k + 1);
  computeHidden(input, hidden);
  // 根据最后一层不同，分别选择不同的遍历方式，找topk
  if (args_->loss == loss_name::hs) {
    dfs(k, threshold, 2 * osz_ - 2, 0.0, heap, hidden);
  } else {
    findKBest(k, threshold, heap, hidden, output);
  }
  std::sort_heap(heap.begin(), heap.end(), comparePairs);
}
void Model::findKBest(
    int32_t k,
    real threshold,
    std::vector<std::pair<real, int32_t>>& heap,
    Vector& hidden,
    Vector& output) const {
  computeOutputSoftmax(hidden, output);
  for (int32_t i = 0; i < osz_; i++) {
    if (output[i] < threshold) {
      continue;
    }
    if (heap.size() == k && std_log(output[i]) < heap.front().first) {
      continue;
    }
    // 把元素放到末尾
    heap.push_back(std::make_pair(std_log(output[i]), i));
    // 调整最小堆
    std::push_heap(heap.begin(), heap.end(), comparePairs);
    if (heap.size() > k) {
      // 堆顶元素pop出堆中
      std::pop_heap(heap.begin(), heap.end(), comparePairs);
      // 删除最后一个元素
      heap.pop_back();
    }
  }
}
void Model::dfs(
    int32_t k,
    real threshold,
    int32_t node,
    real score,
    std::vector<std::pair<real, int32_t>>& heap,
    Vector& hidden) const {
  // 由于取的是对数，所以随着深度的加深，socre是越来越小，
  // 如果中间已经小于阈值，其后续节点都不需要计算
  if (score < std_log(threshold)) {
    return;
  }
  // 同理
  if (heap.size() == k && score < heap.front().first) {
    return;
  }
  // 如果是叶子节点，计算该label的概率，加入到堆中
  if (tree[node].left == -1 && tree[node].right == -1) {
    heap.push_back(std::make_pair(score, node));
    std::push_heap(heap.begin(), heap.end(), comparePairs);
    if (heap.size() > k) {
      std::pop_heap(heap.begin(), heap.end(), comparePairs);
      heap.pop_back();
    }
    return;
  }
  real f;
  if (quant_ && args_->qout) {
    f = qwo_->dotRow(hidden, node - osz_);
  } else {
    f = wo_->dotRow(hidden, node - osz_);
  }
  f = 1. / (1 + std::exp(-f));
  // 根据当前的概率，分别遍历左节点和右节点，属于先根遍历
  dfs(k, threshold, tree[node].left, score + std_log(1.0 - f), heap, hidden);
  dfs(k, threshold, tree[node].right, score + std_log(f), heap, hidden);
}

参数的更新，需要注意一个地方，见注释

void Model::update(const std::vector<int32_t>& input, int32_t target, real lr) {
  assert(target >= 0);
  assert(target < osz_);
  if (input.size() == 0) {
    return;
  }
  computeHidden(input, hidden_);
  if (args_->loss == loss_name::ns) {
    loss_ += negativeSampling(target, lr);
  } else if (args_->loss == loss_name::hs) {
    loss_ += hierarchicalSoftmax(target, lr);
  } else {
    loss_ += softmax(target, lr);
  }
  nexamples_ += 1;
  // 由于是加权平均，所有最后的梯度也需要平均一下
  if (args_->model == model_name::sup) {
    grad_.mul(1.0 / input.size());
  }
}

最后一个，Huffman树的构建，比较有意思，原理参考下面图示，我们知道Huffman构建的过程，选择两个最小的没有使用过的数，构建一个非叶子节点。如果数据是倒序排序会出现什么情况呢。

没错，正如你所见，只要从中间往两边遍历，因为两边的数据都是有序的，选择最小的两个数据，构建中间节点，源码就是这种思路，一看就懂，有木有。

void Model::buildTree(const std::vector<int64_t>& counts) {
  tree.resize(2 * osz_ - 1);
  for (int32_t i = 0; i < 2 * osz_ - 1; i++) {
    tree[i].parent = -1;
    tree[i].left = -1;
    tree[i].right = -1;
    tree[i].count = 1e15;
    tree[i].binary = false;
  }
  for (int32_t i = 0; i < osz_; i++) {
    tree[i].count = counts[i];
  }
  int32_t leaf = osz_ - 1;
  int32_t node = osz_;
  // 中间节点往两边选择最小的点
  for (int32_t i = osz_; i < 2 * osz_ - 1; i++) {
    int32_t mini[2];
    for (int32_t j = 0; j < 2; j++) {
      if (leaf >= 0 && tree[leaf].count < tree[node].count) {
        mini[j] = leaf--;
      } else {
        mini[j] = node++;
      }
    }
    tree[i].left = mini[0];
    tree[i].right = mini[1];
    tree[i].count = tree[mini[0]].count + tree[mini[1]].count;
    tree[mini[0]].parent = i;
    tree[mini[1]].parent = i;
    tree[mini[1]].binary = true;
  }
  for (int32_t i = 0; i < osz_; i++) {
    std::vector<int32_t> path;
    std::vector<bool> code;
    int32_t j = i;
    while (tree[j].parent != -1) {
      path.push_back(tree[j].parent - osz_);
      code.push_back(tree[j].binary);
      j = tree[j].parent;
    }
    paths.push_back(path);
    codes.push_back(code);
  }
}

4 源码修改

4.1 带权重fastText

结构图：

反向传播：

class Attention {
  protected:
    std::vector<real> weight_softmax_;  // 权重softmax值
    std::vector<real> weight_grad_;     // 权重梯度更新
    int64_t size_;
    const int32_t dim_;
  public:
    Attention(int32_t dim): dim_(dim) {}
    explicit Attention(int64_t n, int32_t dim):
                weight_softmax_(n),
                weight_grad_(n),
                size_(n),
                dim_(dim) {}
    // Attention(const Softmax&) = default;
    // Attention& operator=(const Attention&) = delete;
    void zero() {
        std::fill(weight_softmax_.begin(), weight_softmax_.end(), 0.0);
        std::fill(weight_grad_.begin(), weight_grad_.end(), 0.0);
    }
    std::vector<real>& getWeightSoftmax() {
    return weight_softmax_;
    }
    std::vector<real>& getWeightGrad() {
        return weight_grad_;
    }
    void resize(int64_t n) {
        size_ = n;
        weight_softmax_.resize(n);
        weight_grad_.resize(n);
    }
    /*
    * 权重softmax
    */
    void calWeightSoftmax(const Matrix& A, const std::vector<int32_t>& input) {
        if (input.size() > weight_softmax_.size()) {
            weight_softmax_.resize(input.size());
        }
        real sum = 0;
        real max = 0;
        if (input.size() > 0)
            max = A.getWeight(input[0]);
        for ( int64_t i = 0; i < input.size(); ++i ) {
            if ( max < A.getWeight(input[i]))
                max = A.getWeight(input[i]);
        }
        for ( int64_t i = 0; i < input.size(); ++i ) {
             weight_softmax_[i] = std::exp(A.getWeight(input[i]) - max);
             sum += weight_softmax_[i];
        }
        for ( int64_t i = 0; i < input.size(); i++ ) {
             weight_softmax_[i] /= sum;
        }
    }
    /*
    * 计算权重的更新
    */
    void calWeightGrad(const Matrix& A, const std::vector<int32_t>& input,
                        const Vector& grad, const Vector& hidden) {
        if (input.size() > weight_grad_.size()) {
            weight_grad_.resize(input.size());
        }
        // std::fill(weight_grad_.begin(), weight_grad_.end(), 0.0);
        for (int32_t j = 0; j < input.size(); ++j) {
            real sum = 0;
            for (int i = 0; i < dim_; ++i) {
                sum += grad[i]*(A.at(input[j], i) - hidden[i]);
            }
            weight_grad_[j] = weight_softmax_[j]*sum;
        }
        /*
        for (int32_t i = 0; i < input.size(); i++) {
            weight_grad_[i] /= A.size(1);
        }
        */
    }
    void computeHidden(const Matrix& A, const std::vector<int32_t>& input,
                            Vector& hidden) {
        // 计算 权重softmax
        calWeightSoftmax(A, input);
        auto it = input.cbegin();
        auto it_s = weight_softmax_.cbegin();
        for (; it != input.cend() && it_s != weight_softmax_.cend();
                ++it, ++it_s) {
                hidden.addRow(A, *it, *it_s);
        }
    // hidden.mul(1.0 / input.size());
    }
};

4.2 self-attetnion fastText

class Attention {
  protected:
    // 权重参数
    std::vector<real> weight_softmax_;  // 权重softmax值
    // std::vector<real> weight_softmax_grad_mat_;  //  hi 对 wj 的梯度
    std::vector<real> weight_grad_;     // 权重梯度更新
    int64_t size_;
    const int32_t dim_;
    // x self-attention 参数
    std::vector<real> x_softmax_; // x softmax 权重
    std::vector<real> y_;  // yi=sum(wij*xj) x的self-attention
    std::vector<real> hi_xjk_grad_;  //  hi 对 xj,k 的梯度  其中j是固定的
    std::vector<real> xj_grad_;  // xj的梯度
    std::vector<real> max_x_softmax_;
  public:
    Attention(int32_t dim): dim_(dim) {}
    explicit Attention(int64_t n, int32_t dim):
                weight_softmax_(n),
                weight_grad_(n),
                size_(n),
                dim_(dim),
                x_softmax_(n*n),
                y_(n*dim),
                hi_xjk_grad_(dim*dim),
                xj_grad_(dim){}
    // Attention(const Softmax&) = default;
    // Attention& operator=(const Attention&) = delete;
    void zero() {
        std::fill(weight_softmax_.begin(), weight_softmax_.end(), 0.0);
        std::fill(weight_grad_.begin(), weight_grad_.end(), 0.0);
    }
    std::vector<real>& getWeightSoftmax() {
    return weight_softmax_;
    }
    std::vector<real>& getWeightGrad() {
        return weight_grad_;
    }
    void resize(int64_t n) {
        if (n > size_) {
            size_ = n;
            weight_softmax_.resize(n);
            weight_grad_.resize(n);
            x_softmax_.resize(n*n);
            y_.resize(n*dim_);
        }
    }
    /*
    * 权重softmax 归一化
    */
    void calWeightSoftmax(const Matrix& A,
                          const std::vector<int32_t>& input) {
        if (input.size() > weight_softmax_.size()) {
            weight_softmax_.resize(input.size());
        }
        real sum = 0;
        real max = 0;
        if (input.size() > 0)
            max = A.getWeight(input[0]);
        for ( int64_t i = 0; i < input.size(); ++i ) {
            if ( max < A.getWeight(input[i]))
                max = A.getWeight(input[i]);
        }
        for ( int64_t i = 0; i < input.size(); ++i ) {
             weight_softmax_[i] = std::exp(A.getWeight(input[i]) - max);
             sum += weight_softmax_[i];
        }
        for ( int64_t i = 0; i < input.size(); i++ ) {
             weight_softmax_[i] /= sum;
        }
    }
    /*
    * 计算权重的更新
    */
    void calWeightGrad(const Matrix& A,
                       const std::vector<int32_t>& input,
                       const Vector& hidden,
                       const Vector& grad) {
        if (input.size() > weight_grad_.size()) {
            weight_grad_.resize(input.size());
        }
        // std::fill(weight_grad_.begin(), weight_grad_.end(), 0.0);
        for (int32_t j = 0; j < input.size(); ++j) {
            real tmp = 0;
            for (int i = 0; i < dim_; ++i) {
                tmp += grad[i]*(y_[j*dim_ + i]-hidden[i]);
            }
            weight_grad_[j] = weight_softmax_[j]*tmp;
        }
    }
    void computeHidden(const Matrix& A, const std::vector<int32_t>& input,
                            Vector& hidden) {
        // 计算 权重softmax
        calWeightSoftmax(A, input);
        // 计算y值
        calY(A, input);
        hidden.zero();
        for(int32_t i = 0; i < dim_; ++i) {
            for( int32_t j=0; j<input.size(); ++j) {
                hidden[i] += weight_softmax_[j]*y_[j*dim_+i];
            }
        }
    // hidden.mul(1.0 / input.size());
    }
    // 计算self-attention 的权重  x_softmax_[i,j] = e^(xi*xj)/sum(e^xi*xk)
    void calXSoftmax(const Matrix& A, const std::vector<int32_t>& input) {
        if (input.size()*input.size() > x_softmax_.size())
            x_softmax_.resize(input.size()*input.size());
        std::fill(x_softmax_.begin(), x_softmax_.end(), 0.0);
        if ( input.size() > max_x_softmax_.size())
            max_x_softmax_.resize(input.size());
        for(int32_t i = 0; i < input.size(); ++i) {
            for(int32_t j = 0; j < input.size(); ++j) {
                for(int32_t d = 0; d < dim_; ++d){
                    x_softmax_[i*input.size() + j] += A.at(input[i], d)*A.at(input[j], d);
                }
                if ( j == 0)
                    max_x_softmax_[i] = x_softmax_[i*input.size() + j];
                else if (max_x_softmax_[i] < x_softmax_[i*input.size() + j])
                    max_x_softmax_[i] = x_softmax_[i*input.size() + j];
            }
        }
        for(int32_t i = 0; i < input.size(); ++i) {
            real sum = 0;
            for(int32_t j = 0; j < input.size(); ++j) {
                x_softmax_[i*input.size() + j]=
                        std::exp(x_softmax_[i*input.size() + j]
                                  -max_x_softmax_[i]);
                sum += x_softmax_[i*input.size() + j];
            }
            for(int32_t j = 0; j < input.size(); ++j) {
                x_softmax_[i*input.size() + j] /= sum;
            }
        }
    }
    // 计算Y值， y[i] = sum(w[i,j]*xj)
    void calY(const Matrix& A, const std::vector<int32_t>& input) {
        if ( input.size()*dim_ > y_.size())
            y_.resize(input.size()*dim_);
        calXSoftmax(A, input);
        std::fill(y_.begin(), y_.end(), 0.0);
        for(int32_t i = 0; i < input.size(); ++i){
            for(int32_t d = 0; d < dim_; ++d){
                for(int32_t j = 0; j < input.size(); ++j){
                    y_[i*dim_ + d] += x_softmax_[i*input.size() + j]*A.at(input[j], d);
                }
            }
        }
    }
    // 反向传播  hi_xjk_grad_ hi 对 xjk 的梯度 其中j是固定的
    void caHiXjkGrad(const Matrix& A,
                     const std::vector<int32_t>& input, int32_t j) {
        if (dim_*dim_ > hi_xjk_grad_.size())
            hi_xjk_grad_.resize(dim_*dim_);
        std::fill(hi_xjk_grad_.begin(), hi_xjk_grad_.end(), 0.0);
        for (int32_t i = 0; i < dim_; ++i) {
            for (int32_t k = 0; k < dim_; ++k) {
                for (int32_t m = 0; m < input.size(); ++m) {
                    if ( i == k) {
                        hi_xjk_grad_[i*dim_ + k] +=
                      weight_softmax_[m]*x_softmax_[m*input.size() + j];
                    }
                    real sum = 0;
                    sum += A.at(input[m], k)*x_softmax_[m*input.size()
                          + j]*(A.at(input[j], i) - y_[m*dim_ + i]);
                    if (m == j) {
                        sum -= y_[m*dim_ + k]*y_[m*dim_ + i];
                        for (int32_t l = 0; l < input.size(); ++l) {
                            sum += A.at(input[l], k)*x_softmax_[m*input.size()
                                                + l]*A.at(input[l], i);
                        }
                    }
                    hi_xjk_grad_[i*dim_ + k] += weight_softmax_[m]*sum;
                }
            }
        }
    }
    // 反向传播 计算 损失函数对xj的梯度
    void calXGrad(const Matrix& A, const std::vector<int32_t>& input,
                        const Vector& grad, Vector& xgrad, int32_t id) {
        caHiXjkGrad(A, input, id);
        xgrad.zero();
        for (int32_t k = 0; k < dim_; ++k) {
            for (int32_t i = 0; i < dim_; ++i) {
                xgrad[k] += grad[i]*hi_xjk_grad_[i*dim_ + k];
            }
        }
    }
};

5 总结

fastText是一个简单高效的文本分类工具，近两年应用广泛，用简单模型的训练时间得到比肩深度学习的效果。面对这么优秀的开源工具，本文为你详细剖析了其C++源码实现，并在源码基础上做了两个新的尝试：

带权重fasttext
self-attention fastText

这两个尝试前者并不会显著增加训练时间，但是后者会。当然这只是我们一时技痒，加了attention的fastText，就像是给AK47绑上了重机枪的架子，失去了其本来意义。当然，我们希望本文的源码分析能给同行们一些有用的启示，欢迎大家继续互相交流讨论NLP技术。

若有意加入我们团队一起搞事情，把你亮闪闪的简历砸向chenkaijiang001@ke.com 。

继续滑动看下一个