How are the new AI text generators going to learn from the internet?

Is SEO writing and domain writing still going to be a thing

A typewriter by Markus Winkler
A typewriter by Markus Winkler

AI text generators are blowing up. But they aren't new by any means. They have been around for a long time, and by a long time, meaning 5+ years. GPT-2, the predecessor of the GPT-3, launched in 2019, which was essentially the game breaker in this industry.

They are so popular not because they're more capable but because of the format in which they are accessed. Chat-GPT came in and got more than a million users in less than 5 days. The older model used prompts to generate text, with very less control over what was being generated.

The new chat gpt feature allows prompts to be more human-like and, more importantly, allows you to chat continuously with it. Chat GPT holds the context from your conversation over your interaction with it and then builds on it. It has also been integrated with other models of Open AI, particularly codex. Codex is a large text generative model, but instead of generating human readable language, it generates code. This is how they can generate your free code. Although it has minimal use in the professional coding sphere. Github co-pilot also uses the same technology to generate code suggestions for you. And for some extra trivia, GitHub is owned by Microsoft, who is a major investor in OpenAI.

Chat-GPT uses the GPT-3 text model behind the scenes to generate text. It has been modified and has been more refined, but at its core, the GPT-3 model still governs the output of Chat-GPT. GPT-3, which was launched in 2020, has been a game changer but still lacks many features and is notorious for generating wordy content.

How do Chat-GPT and other text generators work

A brief understanding of these AI tools' work is essential to understanding their power and their potential. The generative text models fall into the bracket of things called LLMs, short for Large-Language Models. LLMs are predictive models that predict language tokens. Language tokens are independent notifiers in a sentence, like words, commas, apostrophes, numbers, and others. 

Each token is assigned a numeric value ( because the model can only predict numbers). It is estimated that there are some 100,000 tokens in the English language, so you can get a rough estimate of what we are dealing with. The model takes input from your prompt, the text it has already generated, and then predicts the token next in line.

The model is refined and fine-tuned using text data from the internet and other sources. In its training process, it predicts the next token and is then corrected based on the expected output of the prediction.

Deep Learning neural networks are used to generate this. These networks are very large and require a lot of computation to generate them. The current consensus is that, with more parameters, more accurate results are generated.

GPT-3 model has about 175 billion parameters. ie 175 billion little values that have been can be changed to predict the output.

What data are LLMs trained on?

The models are nothing without the data they have been trained on. The models used a large dataset that many organizations have generated. These datasets include sources from books and poetry from all eras, internet blogs and articles, magazine articles, newspaper articles, and anything that can be represented by text.

This data is then filtered for accuracy and curated for hateful, harmful, and abusive content, so large models can stay away from this potentially harmful content.

Chat-GPT has been known to employ low-wage Nigerian nationals to help in this task and improve their training content quality.

Choosing input data is a serious issue for LLMs.

You cannot just choose the top websites for this. Moreover, selected data has to come from a very diverse field. The LLM model creators can be prejudiced and may choose content to their liking.

This serious question is still not debated in the current ecosystem and is further going to require inquiry.

How can internet data be curated?

So data is taken from the internet. When something is taken from the internet, it's full of bugs and always has a tendency to be bad. 

How can we prevent content from being "bad"?

You reward the "good" content creators with more value.

In short, publications with good reviews. Or in other words, SEO rating.

The topic im explaining here is just good/bad content, but there are other things to consider, like posting frequency, accuracy, and popularity among readers.

All of these things matter, and on top of that democracy of content is necessary. Here is a problem that existed before, and that was solved before d- Domain Ratings and SEO scores.

This is why the curation of content through magazines and publications is still going to be relevant. But there is an imminent change of dynamics, that is for sure.

Final Words

I don't think that SEO writing or, more specifically, publication curation will go away. The whole idea of domain rating and SEO score enables content to be valued in a democratic way.

AI models feed on data to generate text. Curating this input data is of utmost concern. We can't just choose any content from the internet; thus, we must create a value system for it. An absolutely essential feature when this technology is trying to move towards being live with current data.

Writing and blogging may not be over just yet, but there is for sure a big dynamic shift incoming. No one can know how it will affect the world, sure, but all newer technologies thrive by being based on resilient older technologies.

There is a lot to experiment with and a lot to learn going ahead.

You are viewing an NFT