This is a common quote in spanish, and I feel it explains very well the current state of many of the popular generative AI platforms.

You see, everyone agrees generative AI is changing how we research/consume “knowledge” – note there are caveats here as we are all aware, it is aptly called generative after all. Nonetheless, nowadays everyone can skip the tedious side of data collection, interpretation and synthesis and get the concise data (media,text… etc) they are looking for. It is an amazingly powerful tool indeed.

Naturally every day we hear of a new company with a generative AI tool specializing in a different kind of data (text, audio, video, images …etc): these tools are able to “generate” exactly what we are looking for and naturally have led to enormous publicity. Indeed the growth of this sector feels exponential, and the (over?)valuation of these companies as well.

Economically speaking though there are some factors that are not as mainstream, these companies are able to “generate” exactly what we are looking for, but “this data does not come out of thin air”, it relies on learned patterns from existent datasets.

The core of the issue is that machine learning algorithms in general are data intensive, they require enormous amounts of data (preferably annotated) during training; and the large majority of it is mined from the web (or from user’s social media accounts …etc). The crawling of websites and scraping of data is done much akin to how search engines index the web, with one big difference: A search engine’s job is to direct the user to the website where they are most likely to find the information they are looking for. Here there is an implied economic relationship, website owners are willing to let bots crawl their website in exchange for potential visitors.

In the current state of affairs generative tools provide the data the user is looking for and hence provide value directly to users: The users see the tools as the ultimate value provider. All the information and media providers that were used to train the tool receive no economic return (e.g. no site visits). In fact they may not even receive the credit as a trained pattern could be coming from a multiple of sources.

And hence the quote: “Ganando indulgencias con avemarias ajenas” – It is when you take all credit for work that others have done.

Of course there is immense value provided by the generative tools coming to market, but for a new economic framework under which these tools can flourish, there needs to be an economic return to be made by the information providers, which represent many groups.

For example, what happens if:

  • Books
  • Paintings
  • Songs
  • Website articles
  • Videos
  • …etc

Are used for training machine learning models, but we do not compensate

  • Authors
  • Artists
  • Musicians
  • Content creators…

Clearly there will be no economic incentive for anyone to create value for anyone. The model as it is currently stands is not sustainable, and we have started to see the same comments (i.e. lawsuits …etc) come from some of the groups above, which rightfully claim their work was used without their permission to train the machine learning models.

On the one side, there is a belief by some technology groups that given enough time the tools would be able to extrapolate and generate (new) content hence needing much less training data; or conversely that given such enormous amount of sources used to train the tool it will be very difficult to tell which source was used to train the model. (i.e. no need to compensate authors since the training dataset is proprietary and it’s impossible to tell for sure if their work was used or not). In fact this is part of the strategy many companies are currently using in fighting these lawsuits.

On the other side, the more rational approach is developing an economic framework that is sustainable for the whole supply chain, from the media providers to the generative tools, to the user: such that all parties have an incentive to work together.

This is one area where I feel micro-payments could really shine, in rewarding authors whenever their data is being used to train machine learning models. In the future the ads economy may no longer be sustainable as crawling bots don’t click on adds, or do they?