Boost搜索引擎：关键词搜索模块的构建_业界新闻

发布时间:2024-07-28 20:46

阅读量:0

关键词搜索模块是基于索引构建模块编写的。

搜索模块：

搜索模块是在服务器构建索引之后进行的，在构建好的索引的服务器上进行关键词搜索。

首先将用户提供的搜索内容进行，关键词分割，将分割好的关键词存放到一个数组中，再去遍历这个数组，里面的每一个元素都是一个搜索关键词，再调用Index索引构建模块中的查找倒排索引函数，找到与关键词相关的文档，再将这些文档存入tokens_map的map容器中。

如果用户搜索关键词在网页文档中存在的情况下，一个关键词对应一个倒排索引拉链（需要了解倒排索引拉链，以及每个结构体中的成员）。

tokens_map的map容器中存储的是文档ID和struct InvertedElemPrint结构体之间的对应关系。

    struct InvertedElemPrint{         uint64_t doc_id;         int weight;         std::vector<std::string> words;         InvertedElemPrint():doc_id(0), weight(0){}     };

该结构体中存放的是这篇文档的文档ID，权值(所有关键词权值的总和），words容器中存的是那些关键词出现在了这篇文档中。我们可以利用这个words容器进行文章摘要的的提取，下面会提到。

将不同关键词出现在同一文档中的权值进行加和，为了体现这篇文章与搜索内容之间的关系，权值越大表明这篇文章与搜索内容具有很强的相关性。

std::vector<InvertedElemPrint> inverted_list_all;

将 std::unordered_map<uint64_t, InvertedElemPrint> tokens_map 中的文档全部放到inverted_list_all的vector容器利用总权值中进行排序，为用户呈现出最想要的内容。

                  std::sort(inverted_list_all.begin(), inverted_list_all.end(),\                           [](const InvertedElemPrint &e1, const InvertedElemPrint &e2){                           return e1.weight > e2.weight;                           });

排序语句是一条lambda表达式，你也可以写个仿函数传递给sort系统函数。

                //4.[构建]:根据查找出来的结果，构建json串 -- jsoncpp --通过jsoncpp完成序列化&&反序列化                 Json::Value root;                 for(auto &item : inverted_list_all){                     ns_index::DocInfo * doc = index->GetForwardIndex(item.doc_id);                     if(nullptr == doc){                         continue;                     }                     Json::Value elem;                     elem["title"] = doc->title;                     elem["desc"] = GetDesc(doc->content, item.words[0]); //content是文档的去标签的结果，但是不是我们想要的，我们要的是一部分 TODO                     elem["url"]  = doc->url;                     //for deubg, for delete                     elem["id"] = (int)item.doc_id;                     elem["weight"] = item.weight; //int->string                       root.append(elem);                 }                   //Json::StyledWriter writer;                 Json::FastWriter writer;                 *json_string = writer.write(root);

最后将vector排好序的数据进行json串的构建，传递出去。对于json相关知识不太了解的话，请搜所相关资料简单学习。

搜索模块代码：

            //query: 搜索关键字             //json_string: 返回给用户浏览器的搜索结果             void Search(const std::string &query, std::string *json_string)             {                 //1.[分词]:对我们的query进行按照searcher的要求进行分词                 std::vector<std::string> words;                 ns_util::JiebaUtil::CutString(query, &words);                 //2.[触发]:就是根据分词的各个"词"，进行index查找,建立index是忽略大小写，所以搜索，关键字也需要                 //ns_index::InvertedList inverted_list_all; //内部InvertedElem                 std::vector<InvertedElemPrint> inverted_list_all;                   std::unordered_map<uint64_t, InvertedElemPrint> tokens_map;                   for(std::string word : words){                     boost::to_lower(word);                       ns_index::InvertedList *inverted_list = index->GetInvertedList(word);                     if(nullptr == inverted_list){                         continue;                     }                     //不完美的地方：暂时可以交给大家 , 你/是/一个/好人 100                     //inverted_list_all.insert(inverted_list_all.end(), inverted_list->begin(), inverted_list->end());                     for(const auto &elem : *inverted_list){                         auto &item = tokens_map[elem.doc_id]; //[]:如果存在直接获取，如果不存在新建                         //item一定是doc_id相同的print节点                         item.doc_id = elem.doc_id;                         item.weight += elem.weight;                         item.words.push_back(elem.word);                     }                 }                 for(const auto &item : tokens_map){                     inverted_list_all.push_back(std::move(item.second));                 }                   //3.[合并排序]：汇总查找结果，按照相关性(weight)降序排序                 //std::sort(inverted_list_all.begin(), inverted_list_all.end(),\                 //      [](const ns_index::InvertedElem &e1, const ns_index::InvertedElem &e2){                 //        return e1.weight > e2.weight;                 //        });                   std::sort(inverted_list_all.begin(), inverted_list_all.end(),\                           [](const InvertedElemPrint &e1, const InvertedElemPrint &e2){                           return e1.weight > e2.weight;                           });                 //4.[构建]:根据查找出来的结果，构建json串 -- jsoncpp --通过jsoncpp完成序列化&&反序列化                 Json::Value root;                 for(auto &item : inverted_list_all){                     ns_index::DocInfo * doc = index->GetForwardIndex(item.doc_id);                     if(nullptr == doc){                         continue;                     }                     Json::Value elem;                     elem["title"] = doc->title;                     elem["desc"] = GetDesc(doc->content, item.words[0]); //content是文档的去标签的结果，但是不是我们想要的，我们要的是一部分 TODO                     elem["url"]  = doc->url;                     //for deubg, for delete                     elem["id"] = (int)item.doc_id;                     elem["weight"] = item.weight; //int->string                       root.append(elem);                 }                   //Json::StyledWriter writer;                 Json::FastWriter writer;                 *json_string = writer.write(root);             }

文档摘要：

在讲struct InvertedElemPrint结构体时，我就提过摘要的获取.

    struct InvertedElemPrint{         uint64_t doc_id;         int weight;         std::vector<std::string> words;         InvertedElemPrint():doc_id(0), weight(0){}     };

这里详细讲一下，对于words容器中存的是用户传上来的搜索关键词，是部分也可能是全部，这不重要。

我们在实现摘要提取时，是以words中第一个关键词为准。这里有人会问，为什么这样做？

原因是：我想这么做，图方便。但是有没有更优的办法，当然有，不然我也不肯提这个问题。

那怎么做呢？

                  for(std::string word : words){                     boost::to_lower(word);                       ns_index::InvertedList *inverted_list = index->GetInvertedList(word);                     if(nullptr == inverted_list){                         continue;                     }                     //不完美的地方：暂时可以交给大家 , 你/是/一个/好人 100                     //inverted_list_all.insert(inverted_list_all.end(), inverted_list->begin(), inverted_list->end());                     for(const auto &elem : *inverted_list){                         auto &item = tokens_map[elem.doc_id]; //[]:如果存在直接获取，如果不存在新建                         //item一定是doc_id相同的print节点                         item.doc_id = elem.doc_id;                         item.weight += elem.weight;                         item.words.push_back(elem.word);                     }                 }

上面代码是Search()函数中，提取用户搜索关键词的倒排索引拉链，大家应该不陌生了吧。其实看懂上面的Search()函数，也可以想出来这样的解决方法，就是利用该关键词对应的权值进行排序。

我么可以创建一个优先级队列，再创建一个结构体，这个结构体成员就是：该关键词和该关键词对应的权值，再写一个仿函数compare()比较函数（利用权值去比较），将存进去的这些结构体进行排序，优先级队列实则就是一个大堆，第一个元素就是权值最大的，最后再对优先级队列进行遍历，将里面的元素全部插入到words容器中，这样就实现了关键词的排序。

我们在传入第一个关键词，给GetDesc()函数，去寻找该关键词周围的摘要。

            std::string GetDesc(const std::string &html_content, const std::string &word)             {                 //找到word在html_content中的首次出现，然后往前找50字节(如果没有，从begin开始)，往后找100字节(如果没有，到end就可以的)                 //截取出这部分内容                 const int prev_step = 50;                 const int next_step = 100;                 //1. 找到首次出现                 auto iter = std::search(html_content.begin(), html_content.end(), word.begin(), word.end(), [](int x, int y){                         return (std::tolower(x) == std::tolower(y));                         });                 if(iter == html_content.end()){                     return "None1";                 }                 int pos = std::distance(html_content.begin(), iter);                   //2. 获取start，end , std::size_t 无符号整数                 int start = 0;                  int end = html_content.size() - 1;                 //如果之前有50+字符，就更新开始位置                 if(pos > start + prev_step) start = pos - prev_step;                 if(pos < end - next_step) end = pos + next_step;                   //3. 截取子串,return                 if(start >= end) return "None2";                 std::string desc = html_content.substr(start, end - start);                 desc += "...";                 return desc;

GetDesc()函数这个函数没什么技术难度，就是在简单的字符串查找，以及字符串截取，至于截取多少，因人而异，同时也要切合实际。将截取的摘要放到json串中。

以上就是用户搜素内容和文档内容之间建立联系的过程，如有不懂尽可留言，你的留言是我最大的收获。

搜索模块的整体代码search.hpp:

#pragma once  #include "index.hpp" #include "util.hpp" #include "log.hpp" #include <algorithm> #include <unordered_map> #include <jsoncpp/json/json.h>  namespace ns_searcher{      struct InvertedElemPrint{         uint64_t doc_id;         int weight;         std::vector<std::string> words;         InvertedElemPrint():doc_id(0), weight(0){}     };      class Searcher{         private:             ns_index::Index *index; //供系统进行查找的索引         public:             Searcher(){}             ~Searcher(){}         public:             void InitSearcher(const std::string &input)             {                 //1. 获取或者创建index对象                 index = ns_index::Index::GetInstance();                 //std::cout << "获取index单例成功..." << std::endl;                 LOG(NORMAL, "获取index单例成功...");                 //2. 根据index对象建立索引                 index->BuildIndex(input);                 //std::cout << "建立正排和倒排索引成功..." << std::endl;                 LOG(NORMAL, "建立正排和倒排索引成功...");             }             //query: 搜索关键字             //json_string: 返回给用户浏览器的搜索结果             void Search(const std::string &query, std::string *json_string)             {                 //1.[分词]:对我们的query进行按照searcher的要求进行分词                 std::vector<std::string> words;                 ns_util::JiebaUtil::CutString(query, &words);                 //2.[触发]:就是根据分词的各个"词"，进行index查找,建立index是忽略大小写，所以搜索，关键字也需要                 //ns_index::InvertedList inverted_list_all; //内部InvertedElem                 std::vector<InvertedElemPrint> inverted_list_all;                  std::unordered_map<uint64_t, InvertedElemPrint> tokens_map;                  for(std::string word : words){                     boost::to_lower(word);                      ns_index::InvertedList *inverted_list = index->GetInvertedList(word);                     if(nullptr == inverted_list){                         continue;                     }                     //不完美的地方：暂时可以交给大家 , 你/是/一个/好人 100                     //inverted_list_all.insert(inverted_list_all.end(), inverted_list->begin(), inverted_list->end());                     for(const auto &elem : *inverted_list){                         auto &item = tokens_map[elem.doc_id]; //[]:如果存在直接获取，如果不存在新建                         //item一定是doc_id相同的print节点                         item.doc_id = elem.doc_id;                         item.weight += elem.weight;                         item.words.push_back(elem.word);                     }                 }                 for(const auto &item : tokens_map){                     inverted_list_all.push_back(std::move(item.second));                 }                  //3.[合并排序]：汇总查找结果，按照相关性(weight)降序排序                 //std::sort(inverted_list_all.begin(), inverted_list_all.end(),\                 //      [](const ns_index::InvertedElem &e1, const ns_index::InvertedElem &e2){                 //        return e1.weight > e2.weight;                 //        });                   std::sort(inverted_list_all.begin(), inverted_list_all.end(),\                           [](const InvertedElemPrint &e1, const InvertedElemPrint &e2){                           return e1.weight > e2.weight;                           });                 //4.[构建]:根据查找出来的结果，构建json串 -- jsoncpp --通过jsoncpp完成序列化&&反序列化                 Json::Value root;                 for(auto &item : inverted_list_all){                     ns_index::DocInfo * doc = index->GetForwardIndex(item.doc_id);                     if(nullptr == doc){                         continue;                     }                     Json::Value elem;                     elem["title"] = doc->title;                     elem["desc"] = GetDesc(doc->content, item.words[0]); //content是文档的去标签的结果，但是不是我们想要的，我们要的是一部分 TODO                     elem["url"]  = doc->url;                     //for deubg, for delete                     elem["id"] = (int)item.doc_id;                     elem["weight"] = item.weight; //int->string                      root.append(elem);                 }                  //Json::StyledWriter writer;                 Json::FastWriter writer;                 *json_string = writer.write(root);             }              std::string GetDesc(const std::string &html_content, const std::string &word)             {                 //找到word在html_content中的首次出现，然后往前找50字节(如果没有，从begin开始)，往后找100字节(如果没有，到end就可以的)                 //截取出这部分内容                 const int prev_step = 50;                 const int next_step = 100;                 //1. 找到首次出现                 auto iter = std::search(html_content.begin(), html_content.end(), word.begin(), word.end(), [](int x, int y){                         return (std::tolower(x) == std::tolower(y));                         });                 if(iter == html_content.end()){                     return "None1";                 }                 int pos = std::distance(html_content.begin(), iter);                  //2. 获取start，end , std::size_t 无符号整数                 int start = 0;                  int end = html_content.size() - 1;                 //如果之前有50+字符，就更新开始位置                 if(pos > start + prev_step) start = pos - prev_step;                 if(pos < end - next_step) end = pos + next_step;                  //3. 截取子串,return                 if(start >= end) return "None2";                 std::string desc = html_content.substr(start, end - start);                 desc += "...";                 return desc;             }     }; }