GPT stands for Generative Pre-training Transformer, so it does not mean it only excludes OpenAI. I would not block GPTBot, since it could also help with getting source reference links in the future or even suggested within commercial terms, not sure which niche you're in tho.
My friend, you are quite possibly a bit confused.
Every company that harnesses Web Data, uses a crawler. This crawler automatically crawls web pages (also referred to as a spider).
Now, a legit company with a legit crawler always declares it's bot name, and the subsequent IPs.
Google has
GoogleBot (and several others), Microsoft has
BingBot (and several others), DuckDuckGo has
DuckDuckBot.
This is not just limited to search engines. Ahrefs has
AhrefsBot and
AhrefsSiteAudit, Semrush has
SemrushBot,
//
Similar to the above, GPTBot is exclusively owned, operated and crawled by OpenAI
https://platform.openai.com/docs/gptbot
The user-agent token is very explicitly stating it -
Code:
User agent token: GPTBot
Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
Like every other legit firm, they also
publicly mention the IPs they use to crawl
JSON:
{
"creationTime": "2023-11-30T11:51:00.000000",
"prefixes": [
{
"ipv4Prefix": "52.230.152.0/24"
},
{
"ipv4Prefix": "52.233.106.0/24"
}
]
}
//
So yes, blocking GPT Bot only prevents from OpenAI from using your content in their training data. If you want to block ChatGPT from accessing your block, you need to block the
ChatGPT-User Bot (as of writing this, blocking one bot blocks the other)
P.S - It is Generative "Pre-Trained" Transformer and not "Pre-training".
