Firewalling AI-Generated Text; Making GenAI Open Source and Plagiarism Fears
From Abhivardhan, our Chairperson
This is a post authored by Mr Abhivardhan, our Chairperson & Managing Trustee. A complete insight on the technical issue is also available at Visual Legal Analytica.
Here's a reality check on the legality of AI-generated texts.
There is nothing unreasonable on the part of any technical institution to discredit text-based outputs by ChatGPT or Bard.
NYT's case against OpenAI also kind of makes sense.
However, a better way to address the legality is this:
Instead of completing banning paraphrasing of original texts - which will stop or affect GenAI innovation (because every operation of this sort is not an act of stealth, and there are MSMEs and start-ups beyond the world of OpenAI and Google), people must work on building heuristic and semantic protocols on accepting, and rejecting AI-generated texts.
For example, if the Indian Foreign Service one day decides to use text-generating AI in diplomatic and consular service communication, they just have to work on specific semantic and grammar-based protocols they accept in terms of text-responses and prompts. This will go far in having a clean slate and can be privacy-friendly too.
The concept of citation and referencing will have to be changed permanently. It cannot work with the traditional system where things are cited in loop. Also the nature of primary and secondary sources might also change. The purpose of indexed journals and big books might not stay the same way. Blogs, infographics and simulated explanations might replace the older classical ways of writing - already happening.
It does not matter how much effective LLMs are in addressing legal prompts or any domain-specific prompts. The Garbage-in-Garbage-Out problem still persists. It is better that training data must be secured and delineated from the training models. If Open Source GenAI protocols and LLM Ops are achieved, then prompt engineering can be narrowed down - which is the only way text-based Generative AI can be regulated.
The hard-truth - copyright law changes will be dramatic across jurisdictions, and would not happen. This is the harsh truth. Now one can go and write articles about it and propose regulatory solutions - which can be even anti-competitive. But it would not work and only lead us to a series of AI winters or tech winters.
On the protection of gated knowledge and content - publishing entities like NYT and Elsevier and many others would have to concede. The disparity lies clearly. That is however not being addressed in reality.
Let's take a quick example - someone who is running a Substack or a Medium account as compared to a legacy media / publishing platform is prone to suffer even further when their content is scraped by AI systems like GPT 4 and other LLMs.
The best way this works is that content protection practices must be decided prioritising non-legacy media and publishers (and content creators) in consensus. Yes, content restrictions must be implemented to ensure verbatim content is not scraped through ChatGPT. But at the same time, if OpenAI cannot be trusted - the best way forward for any publisher is to enforce Open Source Standards on Data scraping techniques and human-in-the-loop grammatical protocols. These two ways could be the best way to ensure content provenance.
Also, the plagiarism tools people talk about - which might help a lot in the any non-textual AI-generated content like art, music, design etc., might not work in the case of text. There are multiple reasons. While tech by design mapping measures are possible - no doubt - they will not remain as consistent to stay in the loop to decide what exactly is an AI-generated plagiarised form of text. Hence, licensing standards must be the way forward to ensure that plagiarism checking happens properly with rules, and the protocols of writing and grammar are judged to examine plagiarism. Meanwhile, wherever human writings can be tested effectively using Turnitin and Grammarly and others, those tools can still remain helpful.
Finally - we must work on content portability practices for protecting text-based content - to ensure at a bottom-top (not top-to-bottom) level that the copyright protection & justification of protocol-based textual content becomes possible. If we do not achieve that - the problem we might see is that imitating ideas may happen without any code or culture of accountability. A Taxonomy on building grammatical protocols could be a superb way to enable content provenance of AI-generated text.
This is not a rant. Hope it helps.