沐风一岸
沐风一岸
发布于 2026-02-15 / 2 阅读
0
0

告别AI无授权抓取:全网站通用屏蔽教程

前言

本站未使用文章内拦截AI抓取的手段,本文由豆包GLM-5辅助创作

随着生成式AI的快速普及,各类AI爬虫、大模型训练采集工具、AI搜索爬虫正在无差别抓取全网内容,无论是个人站点、企业官网、资讯平台、文档站、电商网站,都面临三大核心风险:一是原创内容被无授权搬运、洗稿,侵权维权难度极高;二是原创内容被用于AI大模型训练,丧失内容专属价值;三是AI爬虫无节制抓取消耗服务器带宽与算力,同时稀释网站SEO权重,甚至出现AI搜索结果优先于原站点的情况。

一、核心前提:精准拦截,不误伤正常流量

在开始操作前,必须明确核心原则:我们仅针对性拦截AI专属爬虫与AI内容采集工具,绝对不屏蔽百度、谷歌、必应等普通搜索引擎的正常收录,避免误操作导致网站流量彻底流失。

以下为当前主流需拦截的AI爬虫标识(来源于:ai.robots.txt项目),后续所有方案均围绕该清单精准配置:

AI爬虫UA与相关AI平台对应表

1. Amazon:Amazonbot、amazon-kendra、AmazonBuyForMe、Amzn-SearchBot、Amzn-User

2. Anthropic:anthropic-ai、Claude-SearchBot、Claude-User、Claude-Web、ClaudeBot

3. Cohere:cohere-ai、cohere-training-data-crawler

4. DeepSeek:DeepSeekBot

5. Google:Gemini-Deep-Research、Google-Extended、Google-CloudVertexBot、Google-NotebookLM、GoogleAgent-Mariner、Google-Firebase、GoogleOther、GoogleOther-Image、GoogleOther-Video

6. OpenAI:GPTBot、OpenAI、ChatGPT Agent、ChatGPT-User

7. Mistral AI:MistralAI-User、MistralAI-User/1.0

8. Perplexity:Perplexity-User、PerplexityBot

9. Petal:PetalBot

10. You.com:YouBot

11. 智谱AI:ChatGLM-Spider

12. 盘古大模型:PanguBot

13. AI2:AI2Bot、AI2Bot-DeepResearchEval、Ai2Bot-Dolma

14. aiHit:aiHitBot

15. AddSearch:AddSearchBot

16. Microsoft Azure:AzureAI-SearchBot

17. Amazon Bedrock:bedrockbot

18. bigsur.ai:bigsur.ai

19. Crawl4AI:Crawl4AI

20. Crawlspace:Crawlspace

21. Diffbot:Diffbot

22. Firecrawl:FirecrawlAgent

23. img2dataset:img2dataset

24. iAsk:iAskBot、iaskspider、iaskspider/2.0

25. Klaviyo:KlaviyoAIBot

26. Kunato:KunatoCrawler

27. LAION:laion-huggingface-processor、LAIONDownloader

28. Liner:LinerBot

29. Meta:Meta-ExternalAgent、meta-externalagent、Meta-ExternalFetcher、meta-externalfetcher、meta-webindexer、FacebookBot、facebookexternalhit

30. MyCentralAI:MyCentralAIScraperBot

31. NotebookLM:NotebookLM

32. OAI-Search:OAI-SearchBot

33. omgili:omgili、omgilibot

34. Panscient:Panscient、panscient.com

35. Poggio:Poggio-Citations

36. Poseidon Research:Poseidon Research Crawler

37. QuillBot:QuillBot、quillbot.com

38. SBIntuitions:SBIntuitionsBot

39. Scrapy:Scrapy

40. Shap:ShapBot

41. TerraCotta:TerraCotta

42. Tavily:TavilyBot

43. TwinAgent:TwinAgent

44. Velen:VelenPublicWebCrawler

45. WARDBot:WARDBot

46. Webzio:Webzio-Extended、webzio-extended

47. YaK:YaK

48. Zanista:ZanistaBot

49. Cloudflare:Cloudflare-AutoRAG

50. Datenbank:Datenbank Crawler

51. DuckDuckGo:DuckAssistBot

52. Factset:Factset_spyderbot

53. Phind:PhindBot

54. Andibot:Andibot

55. Anomura:Anomura

56. Apple:Applebot、Applebot-Extended

57. Atlassian:atlassian-bot

58. Awario:Awario

59. Brave:Bravebot

60. Brightbot:Brightbot 1.0

61. BuddyBot:BuddyBot

62. Bytespider:Bytespider

63. CCBot:CCBot

64. Channel3:Channel3Bot

65. Cotoyogi:Cotoyogi

66. Devin:Devin

67. Echobot:Echobot Bot

68. Echobox:EchoboxBot

69. FriendlyCrawler:FriendlyCrawler

70. IbouBot:IbouBot

71. ICC:ICC-Crawler

72. Imagesift:ImagesiftBot

73. imageSpider:imageSpider

74. ISSCyberRisk:ISSCyberRiskCrawler

75. Kagi:kagi-fetcher

76. Kangaroo Bot:Kangaroo Bot

77. LCC:LCC

78. Linguee:Linguee Bot

79. Linkup:LinkupBot

80. Manus:Manus-User

81. netEstate:netEstate Imprint Crawler

82. NovaAct:NovaAct

83. Operator:Operator

84. Qualified:QualifiedBot

85. Sidetrade:Sidetrade indexer bot

86. Spider:Spider

87. Thinkbot:Thinkbot

88. TikTok:TikTokSpider

89. Timpibot:Timpibot

90. wpbot:wpbot

91. WRTNBot:WRTNBot

92. Yandex:YandexAdditional、YandexAdditionalBot

93. Semrush:SemrushBot-OCOB、SemrushBot-SWA

二、方法一:robots.txt协议拦截

robots协议是所有搜索引擎、爬虫工具默认遵守的互联网抓取规则,无需修改网站代码、无需服务器权限,所有网站100%兼容,是拦截AI抓取的首选基础方案,操作零门槛。

适用场景

所有类型网站,尤其适合小白用户、虚拟主机托管用户、无代码修改权限的用户,可覆盖90%以上遵守协议的正规AI爬虫。

操作步骤

  1. 找到网站根目录:通过FTP工具、主机控制面板、服务器文件管理器,进入网站的根目录(即网站首页文件所在的目录,静态站为index.html所在目录,CMS站点为站点根目录,通常命名为wwwrootpublic_htmlweb)。

  2. 创建/编辑robots.txt文件:若根目录已有robots.txt文件,直接编辑打开;若没有,新建文本文档,重命名为robots.txt(注意后缀必须为.txt,不能是.txt.txt)。

  3. 粘贴通用拦截规则:在文件中添加以下完整规则(直接复制粘贴,无需修改),规则已默认保留所有普通搜索引擎的收录权限,不会影响网站正常SEO。

# 拦截AI相关UA(来源于:ai.robots.txt项目)
User-agent: AddSearchBot
User-agent: AI2Bot
User-agent: AI2Bot-DeepResearchEval
User-agent: Ai2Bot-Dolma
User-agent: aiHitBot
User-agent: amazon-kendra
User-agent: Amazonbot
User-agent: AmazonBuyForMe
User-agent: Amzn-SearchBot
User-agent: Amzn-User
User-agent: Andibot
User-agent: Anomura
User-agent: anthropic-ai
User-agent: Applebot
User-agent: Applebot-Extended
User-agent: atlassian-bot
User-agent: Awario
User-agent: AzureAI-SearchBot
User-agent: bedrockbot
User-agent: bigsur.ai
User-agent: Bravebot
User-agent: Brightbot 1.0
User-agent: BuddyBot
User-agent: Bytespider
User-agent: CCBot
User-agent: Channel3Bot
User-agent: ChatGLM-Spider
User-agent: ChatGPT Agent
User-agent: ChatGPT-User
User-agent: Claude-SearchBot
User-agent: Claude-User
User-agent: Claude-Web
User-agent: ClaudeBot
User-agent: Cloudflare-AutoRAG
User-agent: CloudVertexBot
User-agent: cohere-ai
User-agent: cohere-training-data-crawler
User-agent: Cotoyogi
User-agent: Crawl4AI
User-agent: Crawlspace
User-agent: Datenbank Crawler
User-agent: DeepSeekBot
User-agent: Devin
User-agent: Diffbot
User-agent: DuckAssistBot
User-agent: Echobot Bot
User-agent: EchoboxBot
User-agent: FacebookBot
User-agent: facebookexternalhit
User-agent: Factset_spyderbot
User-agent: FirecrawlAgent
User-agent: FriendlyCrawler
User-agent: Gemini-Deep-Research
User-agent: Google-CloudVertexBot
User-agent: Google-Extended
User-agent: Google-Firebase
User-agent: Google-NotebookLM
User-agent: GoogleAgent-Mariner
User-agent: GoogleOther
User-agent: GoogleOther-Image
User-agent: GoogleOther-Video
User-agent: GPTBot
User-agent: iAskBot
User-agent: iaskspider
User-agent: iaskspider/2.0
User-agent: IbouBot
User-agent: ICC-Crawler
User-agent: ImagesiftBot
User-agent: imageSpider
User-agent: img2dataset
User-agent: ISSCyberRiskCrawler
User-agent: kagi-fetcher
User-agent: Kangaroo Bot
User-agent: KlaviyoAIBot
User-agent: KunatoCrawler
User-agent: laion-huggingface-processor
User-agent: LAIONDownloader
User-agent: LCC
User-agent: LinerBot
User-agent: Linguee Bot
User-agent: LinkupBot
User-agent: Manus-User
User-agent: meta-externalagent
User-agent: Meta-ExternalAgent
User-agent: meta-externalfetcher
User-agent: Meta-ExternalFetcher
User-agent: meta-webindexer
User-agent: MistralAI-User
User-agent: MistralAI-User/1.0
User-agent: MyCentralAIScraperBot
User-agent: netEstate Imprint Crawler
User-agent: NotebookLM
User-agent: NovaAct
User-agent: OAI-SearchBot
User-agent: omgili
User-agent: omgilibot
User-agent: OpenAI
User-agent: Operator
User-agent: PanguBot
User-agent: Panscient
User-agent: panscient.com
User-agent: Perplexity-User
User-agent: PerplexityBot
User-agent: PetalBot
User-agent: PhindBot
User-agent: Poggio-Citations
User-agent: Poseidon Research Crawler
User-agent: QualifiedBot
User-agent: QuillBot
User-agent: quillbot.com
User-agent: SBIntuitionsBot
User-agent: Scrapy
User-agent: SemrushBot-OCOB
User-agent: SemrushBot-SWA
User-agent: ShapBot
User-agent: Sidetrade indexer bot
User-agent: Spider
User-agent: TavilyBot
User-agent: TerraCotta
User-agent: Thinkbot
User-agent: TikTokSpider
User-agent: Timpibot
User-agent: TwinAgent
User-agent: VelenPublicWebCrawler
User-agent: WARDBot
User-agent: Webzio-Extended
User-agent: webzio-extended
User-agent: wpbot
User-agent: WRTNBot
User-agent: YaK
User-agent: YandexAdditional
User-agent: YandexAdditionalBot
User-agent: YouBot
User-agent: ZanistaBot
Disallow: /
# 允许所有普通搜索引擎正常抓取收录
User-agent: *
Allow: /
  1. 上传并生效:编辑完成后,将robots.txt文件保存并上传到网站根目录,覆盖原有文件(若有),无需重启服务器、无需修改网站其他配置,规则实时生效。

规则自定义与站点适配

  • 若仅需禁止AI抓取特定目录(如文章页、文档页),可将Disallow: /改为Disallow: /posts/Disallow: /docs/(对应你的目录路径);

  • 若需禁止AI抓取特定类型文件(如图片、PDF文档),可添加Disallow: /*.jpg$Disallow: /*.pdf$规则;

  • WordPress等CMS用户:除了手动上传文件,也可在后台SEO插件(如Yoast SEO、Rank Math)中找到robots.txt配置模块,直接粘贴上述规则,无需操作服务器文件。

潜在缺点

  • “防君子不防小人”: robots.txt 是一个行业协议,遵守它全凭爬虫开发者的自觉。正规大厂(如 OpenAI, Google, Anthropic):通常会严格遵守,你的配置对它们有效。恶意爬虫/小公司:它们可能会无视 robots.txt,甚至伪造 User-agent(假装自己是普通浏览器或 Googlebot)。对于这类爬虫,此文件无效,需要在服务器防火墙(如 Nginx/Apache 配置)或 Cloudflare 等 WAF 层面进行拦截。

  • 部分 UA 可能影响搜索功能: 你拦截了 Applebot。虽然 Applebot 确实用于训练 Apple 的 AI 模型,但它同时也为 Siri 建议Spotlight 搜索 提供数据。拦截它可能会导致 Apple 设备用户在 Safari 浏览器或手机搜索时无法快速检索到你的网站内容。如果你的网站重视苹果生态的流量,建议自行选择是否要拦截 Applebot

三、方法二:HTML Meta标签+前端JS拦截

该方案通过网站页面代码实现拦截,无需服务器操作权限,只要能修改网站页面代码即可生效,可作为robots协议的补充防护,拦截部分不遵守robots协议的AI爬虫,适配所有网站页面,与方法一搭配可实现双重防护。

适用场景

无服务器管理权限的用户(如静态托管站点、SaaS建站平台、虚拟主机受限用户)、单页应用(SPA)站点、仅需针对特定页面做拦截的场景。

操作步骤

  1. 找到网站全局头部模板:找到网站所有页面都会加载的全局头部文件,核心是包含<head>标签的模板文件:

    1. 静态HTML站点:所有页面的<head>标签区域,可通过批量编辑工具统一添加;

    2. CMS建站(WordPress/Typecho等):后台「外观-主题编辑」中的header.php、header.ftl等头部模板文件;

    3. 单页应用(Vue/React):项目中的index.html入口文件的<head>标签内。

  2. 添加AI抓取屏蔽代码:在<head></head>标签之间,添加以下完整代码(直接复制粘贴,位置无强制要求),代码分为两部分:meta标签告知AI爬虫禁止抓取,JS代码直接拦截AI爬虫的页面访问(代码由GLM-5生成)。

<!-- ==================== Meta 标签策略 ==================== -->
<!-- 标准搜索引擎:允许索引,但禁止生成片段(Snippet)防止AI总结,禁止存档 -->
<meta name="robots" content="index, follow, nosnippet, noarchive, noimageindex" />
<!-- 针对特定AI引擎的实验性标签 (目前主要依靠User-Agent拦截) -->
<meta name="googlebot" content="nosnippet" />
<meta name="bingbot" content="nosnippet" />

<script>
(function() {
    // ==================== 拦截配置 ====================
    const forbiddenAIKeywords = [
        'AddSearchBot', 'AI2Bot', 'AI2Bot-DeepResearchEval', 'Ai2Bot-Dolma',
        'aiHitBot', 'amazon-kendra', 'Amazonbot', 'AmazonBuyForMe',
        'Amzn-SearchBot', 'Amzn-User', 'Andibot', 'Anomura', 'anthropic-ai',
        'Applebot-Extended', // 注意:保留了Applebot-Extended用于AI训练,Applebot用于搜索建议
        'atlassian-bot', 'Awario', 'AzureAI-SearchBot', 'bedrockbot',
        'bigsur.ai', 'Bravebot', 'Brightbot', 'BuddyBot', 'Bytespider',
        'CCBot', 'Channel3Bot', 'ChatGLM-Spider', 'ChatGPT Agent',
        'ChatGPT-User', 'Claude-SearchBot', 'Claude-User', 'Claude-Web',
        'ClaudeBot', 'Cloudflare-AutoRAG', 'CloudVertexBot', 'cohere-ai',
        'cohere-training-data-crawler', 'Cotoyogi', 'Crawl4AI', 'Crawlspace',
        'Datenbank Crawler', 'DeepSeekBot', 'Devin', 'Diffbot', 'DuckAssistBot',
        'Echobot Bot', 'EchoboxBot', 'FacebookBot', 'facebookexternalhit',
        'Factset_spyderbot', 'FirecrawlAgent', 'FriendlyCrawler',
        'Gemini-Deep-Research', 'Google-CloudVertexBot', 'Google-Extended',
        'Google-Firebase', 'Google-NotebookLM', 'GoogleAgent-Mariner',
        'GoogleOther', 'GoogleOther-Image', 'GoogleOther-Video', 'GPTBot',
        'iAskBot', 'iaskspider', 'IbouBot', 'ICC-Crawler', 'ImagesiftBot',
        'imageSpider', 'img2dataset', 'ISSCyberRiskCrawler', 'kagi-fetcher',
        'Kangaroo Bot', 'KlaviyoAIBot', 'KunatoCrawler',
        'laion-huggingface-processor', 'LAIONDownloader', 'LCC', 'LinerBot',
        'Linguee Bot', 'LinkupBot', 'Manus-User', 'meta-externalagent',
        'Meta-ExternalAgent', 'meta-externalfetcher', 'Meta-ExternalFetcher',
        'meta-webindexer', 'MistralAI-User', 'MyCentralAIScraperBot',
        'netEstate Imprint Crawler', 'NotebookLM', 'NovaAct', 'OAI-SearchBot',
        'omgili', 'omgilibot', 'OpenAI', 'Operator', 'PanguBot', 'Panscient',
        'panscient.com', 'Perplexity-User', 'PerplexityBot', 'PetalBot',
        'PhindBot', 'Poggio-Citations', 'Poseidon Research Crawler',
        'QualifiedBot', 'QuillBot', 'quillbot.com', 'SBIntuitionsBot',
        'Scrapy', 'SemrushBot-OCOB', 'SemrushBot-SWA', 'ShapBot',
        'Sidetrade indexer bot', 'Spider', 'TavilyBot', 'TerraCotta',
        'Thinkbot', 'TikTokSpider', 'Timpibot', 'TwinAgent',
        'VelenPublicWebCrawler', 'WARDBot', 'Webzio-Extended', 'webzio-extended',
        'wpbot', 'WRTNBot', 'YaK', 'YandexAdditional', 'YandexAdditionalBot',
        'YouBot', 'ZanistaBot'
    ];

    // 获取当前UA并转为小写
    const ua = navigator.userAgent.toLowerCase();
    
    // 检测是否为禁止的AI爬虫
    const isAIBot = forbiddenAIKeywords.some(keyword => ua.includes(keyword.toLowerCase()));

    if (isAIBot) {
        // 1. 立即停止页面加载(防止图片等资源被抓取)
        if (window.stop) {
            window.stop();
        }
        
        // 2. 替换页面内容为空白或警告,防止瞬间闪现
        document.documentElement.innerHTML = '<html><head><title>403 Forbidden</title></head><body><h1>Access Denied</h1><p>AI Crawlers are not allowed on this site.</p></body></html>';
    }
})();
</script>
  1. 保存并生效:修改完成后保存模板文件/页面文件,更新到网站服务器,刷新页面即可生效。普通用户访问无任何影响,AI爬虫访问时会直接跳转到403禁止访问页面,无法抓取页面内容。

适配补充

  • 静态站点可通过VS Code等编辑器的批量替换功能,给所有HTML页面的<head>标签统一添加代码;

  • 若使用SaaS建站工具(如凡科、建站之星),可在后台「全局设置-自定义代码-头部HTML代码」中粘贴上述代码,无需修改主题文件。

四、方法三:服务器层面底层拦截(进阶防护·全站点生效)

该方案从网站服务器底层拦截AI爬虫,防护优先级最高、生效范围最广,可拦截所有不遵守robots协议、绕过前端代码的恶意AI爬虫,无论网站是什么程序、什么架构,只要使用对应服务器环境即可直接配置,建议与前两种方法搭配使用,实现三重防护。

适用场景

拥有服务器管理权限的用户,云服务器自建站、独立服务器部署的网站,需要高强度防护的企业官网、资讯站点、高价值内容站点。

1. Nginx服务器通用配置(市场占有率最高)

  1. 登录服务器,找到网站对应的Nginx配置文件:通常路径为/etc/nginx/conf.d/下的站点专属配置文件,或Nginx主配置文件nginx.conf中的server配置块;

  2. 在站点的server块内,添加以下配置代码,直接复制粘贴即可:

set $block 0;

if ($http_user_agent ~* "(AddSearchBot|AI2Bot|AI2Bot\-DeepResearchEval|Ai2Bot\-Dolma|aiHitBot|amazon\-kendra|Amazonbot|AmazonBuyForMe|Amzn\-SearchBot|Amzn\-User|Andibot|Anomura|anthropic\-ai|Applebot|Applebot\-Extended|atlassian\-bot|Awario|AzureAI\-SearchBot|bedrockbot|bigsur\.ai|Bravebot|Brightbot\ 1\.0|BuddyBot|Bytespider|CCBot|Channel3Bot|ChatGLM\-Spider|ChatGPT\ Agent|ChatGPT\-User|Claude\-SearchBot|Claude\-User|Claude\-Web|ClaudeBot|Cloudflare\-AutoRAG|CloudVertexBot|cohere\-ai|cohere\-training\-data\-crawler|Cotoyogi|Crawl4AI|Crawlspace|Datenbank\ Crawler|DeepSeekBot|Devin|Diffbot|DuckAssistBot|Echobot\ Bot|EchoboxBot|FacebookBot|facebookexternalhit|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|Gemini\-Deep\-Research|Google\-CloudVertexBot|Google\-Extended|Google\-Firebase|Google\-NotebookLM|GoogleAgent\-Mariner|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iAskBot|iaskspider|iaskspider/2\.0|IbouBot|ICC\-Crawler|ImagesiftBot|imageSpider|img2dataset|ISSCyberRiskCrawler|kagi\-fetcher|Kangaroo\ Bot|KlaviyoAIBot|KunatoCrawler|laion\-huggingface\-processor|LAIONDownloader|LCC|LinerBot|Linguee\ Bot|LinkupBot|Manus\-User|meta\-externalagent|Meta\-ExternalAgent|meta\-externalfetcher|Meta\-ExternalFetcher|meta\-webindexer|MistralAI\-User|MistralAI\-User/1\.0|MyCentralAIScraperBot|netEstate\ Imprint\ Crawler|NotebookLM|NovaAct|OAI\-SearchBot|omgili|omgilibot|OpenAI|Operator|PanguBot|Panscient|panscient\.com|Perplexity\-User|PerplexityBot|PetalBot|PhindBot|Poggio\-Citations|Poseidon\ Research\ Crawler|QualifiedBot|QuillBot|quillbot\.com|SBIntuitionsBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|ShapBot|Sidetrade\ indexer\ bot|Spider|TavilyBot|TerraCotta|Thinkbot|TikTokSpider|Timpibot|TwinAgent|VelenPublicWebCrawler|WARDBot|Webzio\-Extended|webzio\-extended|wpbot|WRTNBot|YaK|YandexAdditional|YandexAdditionalBot|YouBot|ZanistaBot)") {
    set $block 1;
}

if ($request_uri = "/robots.txt") {
    set $block 0;
}

if ($block) {
    return 403;
}
  1. 保存配置文件后,执行以下命令验证配置并重启Nginx生效:

# 先验证配置文件是否有语法错误,提示success再执行重启
nginx -t
# 重启Nginx服务,配置实时生效
systemctl restart nginx

2. Apache服务器通用配置

  1. 登录服务器,找到网站根目录下的.htaccess文件(分布式配置文件),或Apache主配置文件httpd.conf

  2. 在文件中添加以下配置代码,直接复制粘贴即可:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (AddSearchBot|AI2Bot|AI2Bot\-DeepResearchEval|Ai2Bot\-Dolma|aiHitBot|amazon\-kendra|Amazonbot|AmazonBuyForMe|Amzn\-SearchBot|Amzn\-User|Andibot|Anomura|anthropic\-ai|Applebot|Applebot\-Extended|atlassian\-bot|Awario|AzureAI\-SearchBot|bedrockbot|bigsur\.ai|Bravebot|Brightbot\ 1\.0|BuddyBot|Bytespider|CCBot|Channel3Bot|ChatGLM\-Spider|ChatGPT\ Agent|ChatGPT\-User|Claude\-SearchBot|Claude\-User|Claude\-Web|ClaudeBot|Cloudflare\-AutoRAG|CloudVertexBot|cohere\-ai|cohere\-training\-data\-crawler|Cotoyogi|Crawl4AI|Crawlspace|Datenbank\ Crawler|DeepSeekBot|Devin|Diffbot|DuckAssistBot|Echobot\ Bot|EchoboxBot|FacebookBot|facebookexternalhit|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|Gemini\-Deep\-Research|Google\-CloudVertexBot|Google\-Extended|Google\-Firebase|Google\-NotebookLM|GoogleAgent\-Mariner|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iAskBot|iaskspider|iaskspider/2\.0|IbouBot|ICC\-Crawler|ImagesiftBot|imageSpider|img2dataset|ISSCyberRiskCrawler|kagi\-fetcher|Kangaroo\ Bot|KlaviyoAIBot|KunatoCrawler|laion\-huggingface\-processor|LAIONDownloader|LCC|LinerBot|Linguee\ Bot|LinkupBot|Manus\-User|meta\-externalagent|Meta\-ExternalAgent|meta\-externalfetcher|Meta\-ExternalFetcher|meta\-webindexer|MistralAI\-User|MistralAI\-User/1\.0|MyCentralAIScraperBot|netEstate\ Imprint\ Crawler|NotebookLM|NovaAct|OAI\-SearchBot|omgili|omgilibot|OpenAI|Operator|PanguBot|Panscient|panscient\.com|Perplexity\-User|PerplexityBot|PetalBot|PhindBot|Poggio\-Citations|Poseidon\ Research\ Crawler|QualifiedBot|QuillBot|quillbot\.com|SBIntuitionsBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|ShapBot|Sidetrade\ indexer\ bot|Spider|TavilyBot|TerraCotta|Thinkbot|TikTokSpider|Timpibot|TwinAgent|VelenPublicWebCrawler|WARDBot|Webzio\-Extended|webzio\-extended|wpbot|WRTNBot|YaK|YandexAdditional|YandexAdditionalBot|YouBot|ZanistaBot) [NC]
RewriteRule !^/?robots\.txt$ - [F]
  1. 保存配置文件后,重启Apache服务生效:

systemctl restart httpd
# 部分环境重启命令为 systemctl restart apache2

五、方法四:CDN/WAF层面拦截

该方案通过网站使用的CDN/WAF服务实现拦截,完全不需要修改网站代码、不需要操作服务器,在CDN后台配置即可全网生效,同时可拦截AI爬虫的访问请求,减少源站带宽消耗,是当前最简单、高效的防护方案之一,适配所有接入CDN的网站。

适用场景

所有使用CDN/WAF服务的网站,尤其适合小白用户、无服务器权限的用户、静态托管站点,可与robots协议搭配,实现「边缘节点+源站」双重拦截。

1. Cloudflare通用配置步骤(全球通用)

  1. 登录Cloudflare后台,进入对应的网站域名管理页面;

  2. 左侧菜单栏找到「安全→WAF→防火墙规则」,点击「创建防火墙规则」;

  3. 规则名称填写「拦截AI爬虫」,匹配条件选择:用户代理包含 → 粘贴本文中所有AI爬虫标识(用英文逗号分隔);

  4. 操作选择「阻止」,点击「部署」,规则立即生效,所有AI爬虫访问会在Cloudflare边缘节点直接被拦截,不会到达源站;

  5. 补充配置:可在「规则→转换规则→修改响应头」中,添加全局响应头X-Robots-Tag,值为noai, nocache, noaiindex, noarchive,进一步强化防护。

2. 国内云厂商CDN通用配置步骤(阿里云/腾讯云/百度智能云)

  1. 登录对应云厂商控制台,进入CDN/安全加速产品页面,找到对应的域名配置;

  2. 进入「访问控制→UA黑白名单」配置模块,选择「黑名单」,将本文中所有AI爬虫标识粘贴进去,保存配置;

  3. 进入「缓存配置→自定义HTTP响应头」,添加响应头X-Robots-Tag,值为noai, nocache, noaiindex, noarchive

  4. 配置保存后,约1-5分钟全网节点生效,AI爬虫访问会被CDN节点直接拦截,无法访问源站内容。

六、进阶防护补充:全方位加固站点

  1. 页面添加AI抓取禁止声明与版权标注:在网站页脚、文章底部、关于页面,添加明确的版权声明与AI抓取禁止声明,示例:本网站所有原创内容,版权均归属本站所有,未经授权禁止任何形式的转载、采集,禁止用于AI大模型训练、AI内容生成与索引,违者将依法追究侵权责任。,可作为后续侵权维权的法律依据。

  2. 静态资源防护:针对图片、PDF文档、音视频等静态资源,除了robots协议拦截,可在CDN后台配置防盗链,限制非本站域名的资源调用,避免AI爬虫批量抓取静态资源。

  3. 爬虫频率限制:在服务器/CDN后台配置爬虫访问频率限制,针对短时间内高频访问的IP进行封禁,拦截绕过UA标识的恶意AI采集爬虫。

  4. 定期更新拦截规则:AI爬虫的User-Agent标识会持续新增,建议每3-6个月更新一次拦截规则,补充新出现的AI爬虫标识,确保防护无遗漏。

七、避坑指南与通用生效验证方法

核心避坑红线

  1. 绝对禁止误屏蔽全量爬虫:切勿在robots.txt中添加User-agent: * Disallow: /规则,该规则会禁止所有搜索引擎抓取,导致网站彻底无法被收录,流量清零;

  2. 不滥用通配符拦截:避免在UA拦截中使用过于宽泛的通配符,防止误伤正常搜索引擎爬虫、浏览器访问;

  3. 缓存清理:配置修改后若未生效,优先按顺序清理:浏览器缓存→CDN节点缓存→网站程序缓存→服务器缓存,再重新验证;

  4. CDN配置优先级:若网站使用了CDN,必须在CDN后台同步配置拦截规则,否则AI爬虫可通过CDN回源访问,导致源站配置失效。

通用生效验证3步法(所有网站均可使用)

  1. robots规则验证:浏览器访问你的域名/robots.txt,查看是否正常展示你添加的拦截规则,可通过robots检测工具,验证规则是否被爬虫正确识别;

  2. AI爬虫模拟验证:打开浏览器开发者工具(F12)→「网络」→「网络条件」,关闭「使用浏览器默认用户代理」,将用户代理修改为GPTBot,刷新页面,若页面无法正常访问、返回403状态码,说明拦截生效;

  3. 响应头验证:打开浏览器开发者工具(F12)→「网络」,刷新页面,点击首页请求,查看「响应头」,是否包含X-Robots-Tag: noai, nocache, noaiindex, noarchive,确认响应头配置生效。

常见问题排障

  • 配置后不生效:优先检查robots.txt是否放在根目录、服务器配置是否重启、CDN缓存是否清理、UA标识是否拼写错误;

  • 误伤正常访问:检查拦截规则中的UA标识是否过于宽泛,移除与普通浏览器、搜索引擎相关的关键词,缩小拦截范围;

  • 部分AI爬虫仍能抓取:针对不遵守robots协议的爬虫,叠加服务器/CDN层面的IP拦截、频率限制,强化防护。


评论