KarlHeinzSchwuke@feddit.org to Technology@lemmy.worldEnglish · 1 month agoI was wrong about robots.txtevgeniipendragon.comexternal-linkmessage-square22linkfedilinkarrow-up192arrow-down116
arrow-up176arrow-down1external-linkI was wrong about robots.txtevgeniipendragon.comKarlHeinzSchwuke@feddit.org to Technology@lemmy.worldEnglish · 1 month agomessage-square22linkfedilink
minus-squarethedruid@lemmy.worldlinkfedilinkEnglisharrow-up41·1 month agoSo. If I can add something here for everyone’s benefit No search engine really obeys robots.txt Their publicly acknowledged crawlers do, but they have other crawlers that aren’t know that ignore the file. Google knows every inch of your site, allowed or not. See, just because a search engine says it doesn’t know, doesn’t mean it hasn’t crawled. Just doesn’t display the results based on your settings.
minus-squareell1e@leminal.spacelinkfedilinkEnglisharrow-up10arrow-down1·1 month agoAnd allowing the public crawler might also have it feed their AI: https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/
So. If I can add something here for everyone’s benefit
No search engine really obeys robots.txt
Their publicly acknowledged crawlers do, but they have other crawlers that aren’t know that ignore the file.
Google knows every inch of your site, allowed or not.
See, just because a search engine says it doesn’t know, doesn’t mean it hasn’t crawled. Just doesn’t display the results based on your settings.
And allowing the public crawler might also have it feed their AI: https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/