Can ChatGPT crawl live data from URLs? - No, but it's often not necessary. · Christoph C. Cemper

Table of Contents

Can ChatGPT crawl live data from URLs?>

Can ChatGPT crawl live data from URLs? #

Some users would like the well-known ChatGPT from OpenAI to crawl data live from websites when entering URLs and incorporate it into the results. Unfortunately, this is (still) utopian.

As the name “GPT” for General Pretrained Transformer suggests, it is a static, pre-trained language model. The “pre-learned” aspect is characteristic and therefore excludes individual crawling by the model.

It has been extensively reported in countless articles about this innovative AI chat tool, as well as in OpenAI’s own documentation, that the data on which GPT3.5 (ChatGPT’s basis) was trained contains a version of the internet up until the end of 2021.

From a use-case perspective, neither the language model nor the web interface is suitable to replace an SEO tool like Google Search Console, where a current version can be retrieved in seconds at the push of a button.

But that’s not necessary anyway.

Why should ChatGPT crawl live?>

Why should ChatGPT crawl live? #

Of course, search engine optimizers (SEOs) would like to use the most current data when analyzing their content or that of competitors. But ChatGPT is not an SEO crawling tool and has never claimed to be one.

The idea is to use the most up-to-date content from a URL to build on and create your own content. Various content tools offer measurement methods such as keyword density or the somewhat more advanced WF*IDF / TF*IDF (a well-known concept from the 70s, which is used in many text search engines) to provide writers with inspiration for creating more comprehensive articles that should then rank better in search engines. Sometimes there’s talk of “holistic” content.

However, a language model like GPT3.5, which is used in ChatGPT, already contains particularly holistic content. The training of the model was (simply put) created based on a relatively complete internet crawl. But it’s still not a WF*IDF tool for text optimization.

“Fine Tuning” is the technical term for “learning” additional content and is possible with other OpenAI offerings, but (not yet) with ChatGPT. Fine-tuning the language model on new content can help improve the quality of the language model and the AI’s output. However, the associated effort to train the language model comes with significantly higher fees. ChatGPT does not offer this option as a free tool at the moment, but rather only responds to data up until the end of 2021.

Why enter the URL in ChatGPT?>

Why enter the URL in ChatGPT? #

In many aspects, we often don’t understand why a language model responds the way it does. This is also true for the creators and operators of AI tools like ChatGPT themselves.

When you incorporate an existing, established URL into the prompts, relevant aspects are “inferred” from the model (from the English “inference”), meaning they are statistically likely generated, just as all text outputs are only statistically likely generated.

By entering the URL into the language model, you can set certain anchor points that may refer to an old version of the content under this URL. Perhaps only relevant words are extracted from the URL. This depends very much on how “speaking” the URL is.

URLs with descriptive names like https://www.my-camping-shop.com/folding-mattresses naturally work better in prompts than cryptic URLs like https://www.coolstuff.com/c12/p422.

But descriptive URLs have worked better in the Google search engine for 20 years, so why shouldn’t more specific information also help with a modern language model like ChatGPT?

For creating pretty good articles with the “Outrank Article” prompt template from AIPRM for SEO, specifying URLs works surprisingly well. As is often the case, the result can then be significantly improved through follow-up prompts.

Screenshot of the Outrank Article Prompt in AIPRM

ChatGPT sometimes also immediately clarifies that it cannot crawl the internet.

Will ChatGPT replace Google Search?>

Will ChatGPT replace Google Search? #

It turns out that with clever prompts, you can generate amazingly good content that matches what was already available in the past. For most topics, like folding mattresses, there’s a limited number of topics and only minor actuality.

If I ask ChatGPT about the washing machine test winner from January 2023, then I’m using the wrong tool.

The desire for crawling and up-to-date information may come from the fact that for weeks there has been excited speculation about whether ChatGPT could be the Google killer. But if you understand how such a language model is built, and how outdated the content in it is, then it doesn’t make sense.

A language model is not a search engine, not an SEO tool, and certainly - in its current form - not a Google replacement for searching for current content.

However, a language model with data up to the end of 2021 is very practical for generating concrete and very comprehensive (“holistic”) answers when the content doesn’t need to be current. It’s very suitable for the basics of software development, crafting small applications, and in programming tutorials. It’s unsuitable for the latest version of configuration files for the Caddy web server.

When viewed this way, many reports about the possible replacement of Google, even from reputable sources, seem very uninformed but above all sensationalistic when considered today. For the interested reader, I definitely recommend learning about the PalM model, a Google language model which is based on over 500 billion parameters, instead of 175 billion.

Google PalM seems far superior to GPT3/GPT3.5 - it’s just unfortunately not freely available. Google has been working on “artificial intelligence” and language models like GPT3 for probably a decade, as would be expected from a language-based business model.

Diagram from https://lifearchitect.ai/iq-testing-ai/

Can ChatGPT still help me create good content?>

Can ChatGPT still help me create good content? #

The outputs of ChatGPT are controlled solely by the prompts, the input command. The better the prompt, the better the output. The more context the language model gets, the better.

Entering URLs into the prompt does not trigger crawling of the URL, but this is neither necessary nor was it promised to produce good content. I’m a bit surprised that this is being “revealed” after almost two months.

Some users still swear that crawling takes place because the inference results are just as good. But then you have to ask yourself - for which prompt? For the topic “folding mattresses,” not much has changed in 2022 that would have required retraining the language model.

However, the attempt to get current content like the latest sports results from the static language model must fail.

This text was created only with my human workforce and a cup of coffee. The comma errors were corrected by Languagetool. The image of the robot that needs to think while typing was created with Midjourney.