建立网页索引

发表于 2020-12-25 更新于 2025-07-07 分类于设计

经常想搜某个文件夹、站点内的所有内容，而网页会套网页、文件夹会套文件夹，没想到有什么特别好的搜索工具能递归地找出所有内容，于是想到要不把数据都爬到es再提供全文查询，形成自己的网页知识库。

使用

需要先搭建ES环境，可以参考我记录的ES使用方法：使用 ES。

HTTP

提供HTTP接口：PageCrawlController
抓取POST: localhost:8080/page/crawl

{
	"url": "https://tallate.github.io/c395b48b.html",
	"depth": 2
}

查询GET: localhost:8080/page/crawl

1
2
3

{
	"searchKey": "使用 ES"
}

postman

localhost:9200/page/page/_search

{
    "from": 0,
    "size": 5,
    "query": {
        "match": {
            "content": "使用 ES"
        }
    }
}

爬虫代码

读取网页内容
网页是html格式的，为了方便我直接选择用jsoup解析，网页中有很多标签，我需要做的就是读取这些标签中的内容，期间注意过滤无效的标签，比如style、script这种；

Elements allElements = doc.getAllElements();
StringBuilder contentBuilder = new StringBuilder();
for (Element e : allElements) {
    // 一些特殊标签忽略，比如style这种
    if (excludeTags.contains(e.tag().getName())) {
        continue;
    }
    contentBuilder.append(" ").append(e.ownText());
}
page.setContent(contentBuilder.toString());

递归读取
网页通过链接互相关联，读取网页后解析出其中的<a>、<link>即可得到关联网页的内容。
我这波忙活的主要目的其实也是尽量读取更多的数据供查询，不然和直接打开网页查询没有区别。

// 递归访问子链接
Set<String> childUrls = Sets.newHashSet();
Elements elements = doc.getAllElements();
for (Element element : elements) {
    List<String> as = getChildUrls(element, "a");
    List<String> links = getChildUrls(element, "link");
    childUrls.addAll(as);
    childUrls.addAll(links);
}
// 过滤无效的
childUrls = childUrls.stream()
        .filter(u -> isUrlValid(crawlId, u))
        .collect(Collectors.toSet());
log.info("找出的子链接, childUrls:{}", childUrls);
page.getAs().addAll(childUrls);
for (String childUrl : childUrls) {
    // 递归爬
    threadPool.execute(new CrawlTask(crawlId, childUrl, depth - 1));
}

索引建立代码

使用ES的代码就不多说了，就列一下用到的指令了。
创建文档POST: localhost:9200/page/page/123

{
    "as": [
        "a",
        "b"
    ],
    "title": "SSL证书选购",
    "url": "https://buy.cloud.tencent.com/ssl?fromSource=ssl",
    "content": "     SSL证书选购 - 腾讯云                 控制台          中国站    中国站  International    tencent 腾讯开放平台 腾讯会议 DNSPod Discuz! 微信公众平台 腾讯优图 腾讯蓝鲸 腾讯企点 腾讯微云 腾讯文档 友情链接  Copyright © 2013-2020 Tencent Cloud. All Rights Reserved. 腾讯云 版权所有 京公网安备 11010802017518 粤B2-20090059-1 域名注册服务机构批复文号：京信信管发〔2018〕156号 鲁通管〔2019〕83号 粤通业函〔2018〕268号 代理域名注册服务机构：烟台帝思普网络科技有限公司（DNSPod） 新网数码  广州云讯信息科技有限公司     中国站    中国站  International"
}

搜索文档GET: localhost:9200/page/page/_search

{
    "from": 0,
    "size": 100,
    "query": {
        "bool": {
            "should": [
                {
                    "multi_match": {
                        "query": "建立网页索引",
                        "fields": [
                            "title"
                        ],
                        "type": "best_fields",
                        "operator": "OR",
                        "slop": 0,
                        "prefix_length": 0,
                        "max_expansions": 50,
                        "zero_terms_query": "NONE",
                        "auto_generate_synonyms_phrase_query": true,
                        "fuzzy_transpositions": true,
                        "boost": 3
                    }
                },
                {
                    "multi_match": {
                        "query": "建立网页索引",
                        "fields": [
                            "content"
                        ],
                        "type": "best_fields",
                        "operator": "OR",
                        "slop": 0,
                        "prefix_length": 0,
                        "max_expansions": 50,
                        "zero_terms_query": "NONE",
                        "auto_generate_synonyms_phrase_query": true,
                        "fuzzy_transpositions": true,
                        "boost": 1
                    }
                }
            ],
            "adjust_pure_negative": true,
            "boost": 1
        }
    },
    "_source": {
        "includes": [],
        "excludes": []
    }
}

搜索时，因为我更倾向于匹配title，因此用bool+multi_match查询多个字段，并用boost指定title的权重更高。
需要注意的是elasticsearch客户端的版本要和服务器的版本保持一致，不然可能发生各种问题，比如某些参数不支持之类的。