fastText的源码，看这一篇就够了！

欢亿 · 发表于 2024-10-9 14:38:48

fastText的源码，看这一篇就够了！ fastText的源码，看这一篇就够了！王文彬贝壳智搜贝壳智搜我们是来自贝壳找房的工程师，我们热爱技术，热爱算法，我们用技术帮助居住服务者对消费者好，我们持续推进虚拟现实、人工智能、大数据等技术在新居住行业落地。 69篇内容 2018年12月04日 01:51 面对一个新的模型，我们可能会想这个模型谁提出的，可以做什么。然后考虑是否有需要，开始用之前，会问这个模型支持哪些参数，数据格式是什么，至此，可以试试效果。试完效果发现咦还不错，后面可能会横向、纵向比较。使用一段时候开始研究源码，除可以弄清楚原理外，还可以学习很多编程技巧，知其所以然。因此，本解析将会围绕这些展开讨论。1 fasteText能干什么1.1 fastText是什么fastText是Facebook AI Reserch在16年开源的一个词向量及文本分类工具，性能比肩深度学习而且速度更快，能够训练模型“在使用标准多核CPU的情况下10分钟内处理超过10亿个词汇”，特别是与深度模型对比，fastText能将训练时间由数天缩短到几秒钟。并且，它在模型架构上跟word2vec非常相似，两者作者都是Tomas Mikolov。1.2 fastText支持哪些参数现在已经知道fastText能够干什么，而且词向量已经有成名的word2vec，所以让我们聚焦在fastText的另外一个功能分类。在使用模型前，比较重要的就是模型提供了什么参数，具体参考fastText源码[1]。fastText 提供的几个功能参数说明supervised监督学习，训练分类器test评价一个分类器predict预测标签predict-prob预测标签，并输出预测概率skipgram训练skipgram模型cbow训练cbow模型print-vectors输入一个训练好的模型，打印向量fastText 分类提供的参数参数说明默认值-inputtraining file path-outputoutput file path-verboseverbosity level2-minCountminimal number of word occurrences[1]-minCountLabelminimal number of label occurrences[0]-wordNgramsword ngram的最大长度[1]-bucket提供word 和char ngram最大值[2000000]-minnchar ngram的最小长度[0]-maxnchar ngram 的最大长度[0]-tsampling threshold[0.0001]-labellabels prefix[__label__]-lrlearning rate[0.1]-lrUpdateRatechange the rate of updates for the learning rate[100]-dimsize of word vectors[100]-wssize of the context window[5]-epochnumber of epochs[5]-negnumber of negatives sampled[5]-lossloss function {ns, hs, softmax}[softmax]-threadnumber of threads[12]-pretrainedVectors对于分类问题可以提供与训练的词向量[]-saveOutputwhether output params should be saved[0]1.3 fastText 输入输出输入:一行为一条训练数据,每个词可以用空格等常用分隔符分割,对于分类问题来说,在类别前面加"__label__"表示其为该行训练数据的标签,当然可以自定义了,可以使用-label自定义,训练文件支持 UTF-8 格式,字符会按照UTF-8编码分割。输出:输出支持topK,带概率.1.4 fastText 为什么如此快多线程，不加锁，同时更新参数，会带来一点点噪音，但是影响不大。模型抽象，最后一层都抽象为LR，预先计算sigmoid的值。见model模块分析。模型本身足够简单。2 fastText与word2vec的异同如果有对word2vec原理不清楚的可以看这篇博客word2vec[2]，详细介绍了word2vec的数学原理。2.1 相同点了解word2vec的同学，可以发现fasttext的模型结构和思想基本一致，倒是省去了理解的步骤。2.2 不同点增加分类增加word和char ngram3 源码解析fasttext整体的结构图如下：$x_i$为根据dictionary找到每个词对应的id，根据id取对应的向量。然后，求和平均得到中间隐变量h。最后，经过Negative Sampling、Softmax或者Hierarchical Softmax进行分类。3.1 源码结构源码提供了以下几个模块，核心model模块提供各种算法：3.2 头文件解析类说明args.h提供主要参数解析dictionary.h主要模块，提供计数、映射、ngramfasttext.h集合各个类matrix.h词向量矩阵model.h核心模块，各种算法real.h数据类型utils.h文件相关的工具类vector.h向量抽象类3.3 主要类fasttext.hfasttext 是最顶层的模块，它的主要功能是训练和预测，首先是训练功能的调用路径，第一个函数是 train，它的主要作用是初始化参数，启动多线程训练。// 训练void FastText::train(const Args args) { args_ = std::make_shared(args); dict_ = std::make_shared(args_); if (args_->input == "-") { // manage expectations throw std::invalid_argument("Cannot use stdin for training!"); } std::ifstream ifs(args_->input); if (!ifs.is_open()) { throw std::invalid_argument( args_->input + " cannot be opened for training!"); } // 字典的构建 dict_->readFromFile(ifs); ifs.close(); // 加载预训练的词向量 if (args_->pretrainedVectors.size() != 0) { loadVectors(args_->pretrainedVectors); } else { // 否则随机生成词典，词典的大小等于词的个数加上ngram个个数或者说是参数bucket的大小 input_ = std::make_shared(dict_->nwords()+args_->bucket, args_->dim); input_->uniform(1.0 / args_->dim); } // 区分是训练词向量还是分了，本质上一样，只是一个目标是词一个是类别 // 因此输出矩阵大大小不一样 if (args_->model == model_name::sup) { output_ = std::make_shared(dict_->nlabels(), args_->dim); } else { output_ = std::make_shared(dict_->nwords(), args_->dim); } output_->zero(); startThreads(); model_ = std::make_shared(input_, output_, args_, 0); if (args_->model == model_name::sup) { model_->setTargetCounts(dict_->getCounts(entry_type::label)); } else { model_->setTargetCounts(dict_->getCounts(entry_type::word)); }}具体多线程训练部分，按照线程数把文件分成多个部分，每个部分独立更新参数，线程之间并没有加锁。因此，会带来一点噪音，但是对最后的结果没有多大影响。word2vec 实现也没有加锁。void FastText::trainThread(int32_t threadId) { std::ifstream ifs(args_->input); // 根据线程id选择要训练的数据，按照字节数分割，不是按照行分割 utils::seek(ifs, threadId * utils::size(ifs) / args_->thread); // 每个线程都会新建一个model，但是model参数是共享的 Model model(input_, output_, args_, threadId); // 区分是词向量还是分类 if (args_->model == model_name::sup) { model.setTargetCounts(dict_->getCounts(entry_type::label)); } else { model.setTargetCounts(dict_->getCounts(entry_type::word)); } const int64_t ntokens = dict_->ntokens(); int64_t localTokenCount = 0; std::vector line, labels; while (tokenCount_ epoch * ntokens) { real progress = real(tokenCount_) / (args_->epoch * ntokens); real lr = args_->lr * (1.0 - progress); // 不同的模型选择不同的训练方式 if (args_->model == model_name::sup) { localTokenCount += dict_->getLine(ifs, line, labels); supervised(model, lr, line, labels); } else if (args_->model == model_name::cbow) { localTokenCount += dict_->getLine(ifs, line, model.rng); cbow(model, lr, line); } else if (args_->model == model_name::sg) { localTokenCount += dict_->getLine(ifs, line, model.rng); skipgram(model, lr, line); } // 迭代过程中，根据迭代次数，降低学习率 if (localTokenCount > args_->lrUpdateRate) { tokenCount_ += localTokenCount; localTokenCount = 0; if (threadId == 0 & args_->verbose > 1) loss_ = model.getLoss(); } } if (threadId == 0) loss_ = model.getLoss(); ifs.close();}你没有看错，fasttext提供的三个功能，它们的代码很相似，最后的更新后抽象成model.update(line, labels[i], lr)。supervised的输入就是词对应的id、标签、学习率等。model的update见3.5 model分析。void FastText::supervised( Model& model, real lr, const std::vector& line, const std::vector& labels) { if (labels.size() == 0 || line.size() == 0) return; // 支持一条样本多个标签，随机选择一个标签 std::uniform_int_distribution uniform(0, labels.size() - 1); int32_t i = uniform(model.rng); // 更新参数 model.update(line, labels[i], lr);}void FastText::cbow(Model& model, real lr, const std::vector& line) { std::vector bow; std::uniform_int_distribution uniform(1, args_->ws); for (int32_t w = 0; w = 0 & w + c & ngrams = dict_->getSubwords(line[w + c]); bow.insert(bow.end(), ngrams.cbegin(), ngrams.cend()); } } model.update(bow, line[w], lr); }}void FastText::skipgram(Model& model, real lr, const std::vector& line) { std::uniform_int_distribution uniform(1, args_->ws); for (int32_t w = 0; w & ngrams = dict_->getSubwords(line[w]); // 根据中间词，预测上下文 for (int32_t c = -boundary; c = 0 & w + c subwords; //char gram};整个dictionary类的主要函数功能说明int32_t Dictionary::find(const std::string& w, uint32_t h) const { int32_t word2intsize = word2int_.size(); int32_t id = h % word2intsize; // 根据word 和 hash值查找其id while (word2int_[id] != -1 & words_[word2int_[id]].word != w) { id = (id + 1) % word2intsize; } return id;}void Dictionary::add(const std::string& w) { int32_t h = find(w); ntokens_++; // 将词添加到words_中 if (word2int_[h] == -1) { entry e; e.word = w; e.count = 1; e.type = getType(w); words_.push_back(e); word2int_[h] = size_++; } else { words_[word2int_[h]].count++; }}void Dictionary::getSubwords(const std::string& word, std::vector& ngrams, std::vector& substrings) const { int32_t i = getId(word); ngrams.clear(); substrings.clear(); if (i >= 0) { ngrams.push_back(i); substrings.push_back(words_[i].word); } if (word != EOS) { computeSubwords(BOW + word + EOW, ngrams, &substrings); }}uint32_t Dictionary::hash(const std::string& str) const { uint32_t h = 2166136261; for (size_t i = 0; i & ngrams, std::vector* substrings) const { for (size_t i = 0; i maxn; n++) { ngram.push_back(word[j++]); while (j = args_->minn & !(n == 1 & (i == 0 || j == word.size()))) { int32_t h = hash(ngram) % args_->bucket; pushHash(ngrams, h); if (substrings) { substrings->push_back(ngram); } } } }}void Dictionary::threshold(int64_t t, int64_t tl) { sort(words_.begin(), words_.end(), [](const entry& e1, const entry& e2) { if (e1.type != e2.type) return e1.type e2.count; }); words_.erase(remove_if(words_.begin(), words_.end(), [&](const entry& e) { return (e.type == entry_type::word & e.count word); word2int_[h] = size_++; if (it->type == entry_type::word) nwords_++; if (it->type == entry_type::label) nlabels_++; }}void Dictionary::addWordNgrams(std::vector& line, const std::vector& hashes, int32_t n) const { for (int32_t i = 0; i bucket); } }}void Dictionary::addSubwords(std::vector& line, const std::string& token, int32_t wid) const { if (wid maxn & ngrams = getSubwords(wid); line.insert(line.end(), ngrams.cbegin(), ngrams.cend()); } }}3.5 核心类 model.h盗图一张，很明白清楚的指出了model类做了些什么事情，参数和源码参数保持一致。所有核心算法都集中到了model类，里面提供了各种实现，包括LR，Huffman、BP等，还是很有意思的。同时这个模块把negativeSampling、hierarchicalSoftmax都抽象成LR。首先，让我们看下binaryLogistic这个函数，损失函数为常见的交叉熵损失函数。如果是正类，则为f(x)=-log(score)，如果为负类，则为：f(x)=-log(1-score) 现在看下其导数是什么，为了保持和代码一致，都使用源码的符号，推导如下：我们令x=wo->dotRow(hidden, target)，我们都知道sigmoid函数的偏导为socre*(1-socre)，对于正类而言：我们的目标是使得损失函数最小，所以是梯度下降，再结合学习率，因此hidden_的梯度更新如下：real Model::binaryLogistic(int32_t target, bool label, real lr) { real score = sigmoid(wo_->dotRow(hidden_, target)); real alpha = lr * (real(label) - score); // 参考上文公式推导 grad_.addRow(*wo_, target, alpha); wo_->addRow(hidden_, target, alpha); if (label) { return -log(score); } else { return -log(1.0 - score); }}具体negativeSampling、hierarchicalSoftmax、Softmax的实现real Model::negativeSampling(int32_t target, real lr) { real loss = 0.0; grad_.zero(); for (int32_t n = 0; n neg; n++) { // 负采样，一个正样本，n个负样本 if (n == 0) { loss += binaryLogistic(target, true, lr); } else { loss += binaryLogistic(getNegative(target), false, lr); } } return loss;}real Model::hierarchicalSoftmax(int32_t target, real lr) { real loss = 0.0; grad_.zero(); const std::vector& binaryCode = codes[target]; const std::vector& pathToRoot = paths[target]; // 根据目标点，找到其路径，和对应的二分类的标签 for (int32_t i = 0; i addRow(hidden_, i, alpha); } return -log(output_[target]);}看一下隐藏层的计算，需要注意的是：求和平均void Model::computeHidden(const std::vector& input, Vector& hidden) const { assert(hidden.size() == hsz_); hidden.zero(); for (auto it = input.cbegin(); it != input.cend(); ++it) { if (quant_) { hidden.addRow(*qwi_, *it); } else { hidden.addRow(*wi_, *it); } } // 求和平均 hidden.mul(1.0 / input.size());}预测的代码，只有当用于分类时，其才有用，基本操作，用堆保留最大的几个值。如果为最后一层为hierarchicalSoftmax，这需要遍历整个Huffman树，找其topK，对应源码dfs函数，实现了深度优先搜索。如果为Softmax，直接计算即可，找其topK，对应源码findKBest。void Model::predict( const std::vector& input, int32_t k, real threshold, std::vector>& heap, Vector& hidden, Vector& output) const { // 判断k值是否符合要求 if (k == Model::kUnlimitedPredictions) { k = osz_; } else if (k model != model_name::sup) { throw std::invalid_argument("Model needs to be supervised for prediction!"); } heap.reserve(k + 1); computeHidden(input, hidden); // 根据最后一层不同，分别选择不同的遍历方式，找topk if (args_->loss == loss_name::hs) { dfs(k, threshold, 2 * osz_ - 2, 0.0, heap, hidden); } else { findKBest(k, threshold, heap, hidden, output); } std::sort_heap(heap.begin(), heap.end(), comparePairs);}void Model::findKBest( int32_t k, real threshold, std::vector>& heap, Vector& hidden, Vector& output) const { computeOutputSoftmax(hidden, output); for (int32_t i = 0; i k) { // 堆顶元素pop出堆中 std::pop_heap(heap.begin(), heap.end(), comparePairs); // 删除最后一个元素 heap.pop_back(); } }}void Model::dfs( int32_t k, real threshold, int32_t node, real score, std::vector>& heap, Vector& hidden) const { // 由于取的是对数，所以随着深度的加深，socre是越来越小， // 如果中间已经小于阈值，其后续节点都不需要计算 if (score k) { std::pop_heap(heap.begin(), heap.end(), comparePairs); heap.pop_back(); } return; } real f; if (quant_ & args_->qout) { f = qwo_->dotRow(hidden, node - osz_); } else { f = wo_->dotRow(hidden, node - osz_); } f = 1. / (1 + std::exp(-f)); // 根据当前的概率，分别遍历左节点和右节点，属于先根遍历 dfs(k, threshold, tree[node].left, score + std_log(1.0 - f), heap, hidden); dfs(k, threshold, tree[node].right, score + std_log(f), heap, hidden);}参数的更新，需要注意一个地方，见注释void Model::update(const std::vector& input, int32_t target, real lr) { assert(target >= 0); assert(target loss == loss_name::ns) { loss_ += negativeSampling(target, lr); } else if (args_->loss == loss_name::hs) { loss_ += hierarchicalSoftmax(target, lr); } else { loss_ += softmax(target, lr); } nexamples_ += 1; // 由于是加权平均，所有最后的梯度也需要平均一下 if (args_->model == model_name::sup) { grad_.mul(1.0 / input.size()); }}最后一个，Huffman树的构建，比较有意思，原理参考下面图示，我们知道Huffman构建的过程，选择两个最小的没有使用过的数，构建一个非叶子节点。如果数据是倒序排序会出现什么情况呢。没错，正如你所见，只要从中间往两边遍历，因为两边的数据都是有序的，选择最小的两个数据，构建中间节点，源码就是这种思路，一看就懂，有木有。void Model::buildTree(const std::vector& counts) { tree.resize(2 * osz_ - 1); for (int32_t i = 0; i = 0 & tree[leaf].count path; std::vector code; int32_t j = i; while (tree[j].parent != -1) { path.push_back(tree[j].parent - osz_); code.push_back(tree[j].binary); j = tree[j].parent; } paths.push_back(path); codes.push_back(code); }}4 源码修改4.1 带权重fastText结构图：反向传播：class Attention { protected: std::vector weight_softmax_; // 权重softmax值 std::vector weight_grad_; // 权重梯度更新 int64_t size_; const int32_t dim_; public: Attention(int32_t dim): dim_(dim) {} explicit Attention(int64_t n, int32_t dim): weight_softmax_(n), weight_grad_(n), size_(n), dim_(dim) {} // Attention(const Softmax&) = default; // Attention& operator=(const Attention&) = delete; void zero() { std::fill(weight_softmax_.begin(), weight_softmax_.end(), 0.0); std::fill(weight_grad_.begin(), weight_grad_.end(), 0.0); } std::vector& getWeightSoftmax() { return weight_softmax_; } std::vector& getWeightGrad() { return weight_grad_; } void resize(int64_t n) { size_ = n; weight_softmax_.resize(n); weight_grad_.resize(n); } /* * 权重softmax */ void calWeightSoftmax(const Matrix& A, const std::vector& input) { if (input.size() > weight_softmax_.size()) { weight_softmax_.resize(input.size()); } real sum = 0; real max = 0; if (input.size() > 0) max = A.getWeight(input[0]); for ( int64_t i = 0; i & input, const Vector& grad, const Vector& hidden) { if (input.size() > weight_grad_.size()) { weight_grad_.resize(input.size()); } // std::fill(weight_grad_.begin(), weight_grad_.end(), 0.0); for (int32_t j = 0; j & input, Vector& hidden) { // 计算权重softmax calWeightSoftmax(A, input); auto it = input.cbegin(); auto it_s = weight_softmax_.cbegin(); for (; it != input.cend() & it_s != weight_softmax_.cend(); ++it, ++it_s) { hidden.addRow(A, *it, *it_s); } // hidden.mul(1.0 / input.size()); }};4.2 self-attetnion fastTextclass Attention { protected: // 权重参数 std::vector weight_softmax_; // 权重softmax值 // std::vector weight_softmax_grad_mat_; // hi 对 wj 的梯度 std::vector weight_grad_; // 权重梯度更新 int64_t size_; const int32_t dim_; // x self-attention 参数 std::vector x_softmax_; // x softmax 权重 std::vector y_; // yi=sum(wij*xj) x的self-attention std::vector hi_xjk_grad_; // hi 对 xj,k 的梯度其中j是固定的 std::vector xj_grad_; // xj的梯度 std::vector max_x_softmax_; public: Attention(int32_t dim): dim_(dim) {} explicit Attention(int64_t n, int32_t dim): weight_softmax_(n), weight_grad_(n), size_(n), dim_(dim), x_softmax_(n*n), y_(n*dim), hi_xjk_grad_(dim*dim), xj_grad_(dim){} // Attention(const Softmax&) = default; // Attention& operator=(const Attention&) = delete; void zero() { std::fill(weight_softmax_.begin(), weight_softmax_.end(), 0.0); std::fill(weight_grad_.begin(), weight_grad_.end(), 0.0); } std::vector& getWeightSoftmax() { return weight_softmax_; } std::vector& getWeightGrad() { return weight_grad_; } void resize(int64_t n) { if (n > size_) { size_ = n; weight_softmax_.resize(n); weight_grad_.resize(n); x_softmax_.resize(n*n); y_.resize(n*dim_); } } /* * 权重softmax 归一化 */ void calWeightSoftmax(const Matrix& A, const std::vector& input) { if (input.size() > weight_softmax_.size()) { weight_softmax_.resize(input.size()); } real sum = 0; real max = 0; if (input.size() > 0) max = A.getWeight(input[0]); for ( int64_t i = 0; i & input, const Vector& hidden, const Vector& grad) { if (input.size() > weight_grad_.size()) { weight_grad_.resize(input.size()); } // std::fill(weight_grad_.begin(), weight_grad_.end(), 0.0); for (int32_t j = 0; j & input, Vector& hidden) { // 计算权重softmax calWeightSoftmax(A, input); // 计算y值 calY(A, input); hidden.zero(); for(int32_t i = 0; i & input) { if (input.size()*input.size() > x_softmax_.size()) x_softmax_.resize(input.size()*input.size()); std::fill(x_softmax_.begin(), x_softmax_.end(), 0.0); if ( input.size() > max_x_softmax_.size()) max_x_softmax_.resize(input.size()); for(int32_t i = 0; i & input) { if ( input.size()*dim_ > y_.size()) y_.resize(input.size()*dim_); calXSoftmax(A, input); std::fill(y_.begin(), y_.end(), 0.0); for(int32_t i = 0; i & input, int32_t j) { if (dim_*dim_ > hi_xjk_grad_.size()) hi_xjk_grad_.resize(dim_*dim_); std::fill(hi_xjk_grad_.begin(), hi_xjk_grad_.end(), 0.0); for (int32_t i = 0; i & input, const Vector& grad, Vector& xgrad, int32_t id) { caHiXjkGrad(A, input, id); xgrad.zero(); for (int32_t k = 0; k

		自动登录	找回密码
密码			会员注册