Explore the evolution of GenAI architectures, from early GANs to today’s advanced transformers and diffusion models.
What is GenAI?
GenAI models generate text, audio, video, and image output from training data. It takes a lot of resources to train large general-purpose models from scratch, but you can modify these foundation models more easily by retrieving or fine-tuning them using a private dataset.
Generative AI Types
Large Languages of Models (LLM)
LLMs have the potential to significantly improve society because of their capacity to produce text, summarise and translate information, answer enquiries, participate in discussions, and carry out sophisticated activities like reasoning or arithmetic problems.
Image, Video, and Audio Generation
Using descriptive text or image input, image or video generation generates new content. Using input audio or text cues, audio generation mimics and produces voices, sounds, and music. Usually, transformer or diffusion models are used in these methods.
Code Generation
GenAI may use text prompts or natural language to generate or recommend new code snippets. Additionally, this technology can effectively test and debug computer programs and convert code between programming languages.
Retrieval Augmented Generation (RAG)
During inference, Retrieval Augmented Generation (RAG) adds current, private, and proprietary data from vector databases to pretrained models. This makes updating and customisation easier and makes it possible to credit the source of generated information.
GenAI Use Cases
GenAI has the ability to improve content creation procedures, transform the creative industries, and spur innovation in a variety of fields and applications.
Creative Content Generation
Produce fresh and captivating text, music, video, and image material for design, advertising, and entertainment.
Augmenting Data
Create synthetic data to enhance training datasets for deep learning and machine learning models, or to aid in enhancing the generalisation and performance of existing models.
Gaming and Simulation
Create virtual worlds, personas, and situations that improve simulation and gaming applications’ realism and interactivity.
Healthcare and Medicine
Create artificial medical data or images to support diagnosis, treatment planning, and medical research.
Natural Language Processing (NLP)
Create dialogue systems, text synthesis, and Natural Language Processing NLP creation for chatbots, language translation, and content summarisation applications.
Personalization and Recommendation Systems
Provide tailored product recommendations, ads, or recommendations based on user preferences and behaviour.
Visualisation of Design
Render pictures of apparel possibilities displayed on a customer’s avatar or design concepts for their home or workplace.
Drug Discovery and Materials Science
Recognise substances that accomplish several goals while comprehending the intricate physical and chemical connections that underlie them.
How Does GenAI work
Generative AI primarily functions in three stages:
- Training, to develop a foundational model that can be used as the basis for applications of multiple generations of artificial intelligence.
- Tuning is the process of adapting the foundation model to a particular generation AI application.
- creation, analysis, and fine-tuning to evaluate the results of the gen AI application and continuously enhance its precision and quality.
Training
The foundation model for generative AI is a deep learning model that forms the base of several generative AI application types. Large language models (LLMs), developed for text generation applications, are currently the most widely used foundation models. However, multimodal foundation models that can support multiple types of content generation are also available, as are foundation models for image, video, sound, and music generation.
Terabytes of raw, unstructured, unlabelled data, such as those taken from the internet or another massive data source, are used by practitioners to train a deep learning algorithm in order to build a foundation model. The algorithm attempts to predict the next element in a sequence, such as the next word in a sentence, the next element in an image, or the next command in a line of code, by performing and evaluating millions of “fill in the blank” exercises during training. It continuously modifies itself to reduce the discrepancy between its predictions and the actual data.
A neural network of parameters, or encoded representations of the entities, patterns, and relationships in the data, is the end product of this training process. This network can produce material on its own in response to prompts or inputs.
Thousands of clustered graphics processing units (GPUs) and weeks of processing are needed for this computationally demanding, time-consuming, and costly training procedure, which costs millions of dollars. This stage and its expenses can be avoided by gen AI developers with open-source foundation model projects like Meta’s Llama-2.
Tuning
A foundation model is a generalist in a metaphorical sense: Although it is very knowledgeable about a wide range of content kinds, it frequently lacks the fidelity or precision necessary to produce particular output types. The model needs to be adjusted for a particular content creation task in order to do that. There are several ways to accomplish this.
Fine tuning
The process of fine-tuning entails providing the model with labelled data unique to the content creation application, such as queries or prompts that the application is likely to encounter and the matching, formatted proper answers. The model would be fed hundreds or thousands of documents with labelled customer service enquiries and their appropriate responses, for instance, if a development team was attempting to design a customer service chatbot.
It takes a lot of work to fine-tune. Developers frequently assign the work to businesses with sizable data-labeling staffs.
Reinforcement learning with human feedback (RLHF)
In RLHF, the model is updated for increased accuracy or relevance based on human users’ judgements of generated material. People frequently “score” various outputs in response to the same inquiry in RLHF. However, it can be as easy as having users correct the output of a chatbot or virtual assistant by typing or speaking back to it.
Generation, evaluation, more tuning
In order to improve accuracy or relevance, developers and users continuously evaluate the results of their generative AI apps and adjust the model, sometimes as frequently as once a week.
RAG, or retrieval augmented generation, is another way to boost the performance of a gen AI program. RAG is a framework for expanding the foundation model to enhance and improve the parameters or representations in the original model by using pertinent sources outside of the training data. A generative AI app can always have access to the most recent data with RAG. In addition, the knowledge in the initial foundation model is not as clear to users as the additional sources available through RAG.
Benefits of generative AI
Increased efficiency is generative AI’s clear, main advantage. Gen AI has the ability to automate or speed up labour-intensive operations, reduce expenses, and free up employees’ time for higher-value work because it can produce information and answers on demand.
However, generative AI has a number of additional advantages for both individuals and businesses.
Enhanced creativity
Through automated brainstorming that produces numerous original content iterations, Gen AI systems can stimulate creativity. In order to overcome creative barriers, writers, painters, designers, and other creators can also use these variations as a starting point or reference.
Improved (and faster) decision-making
Large dataset analysis, pattern recognition, and significant insight extraction are all areas in which generative AI shines. Based on these insights, it then generates hypotheses and suggestions to help researchers, analysts, executives, and other professionals make more informed decisions.
Dynamic personalization
Generative AI can analyse user preferences and history to create personalised content in real time for applications like content generation and recommendation systems, making the user experience more engaging and customised.
Constant availability
Tasks like automatic answers and chatbots for customer service are always available because to generative AI’s constant, fatigue-free operation.
GenAI architectures and how they have evolved
Over the past 12 years or so, deep learning that can produce material on demand on their own have developed into truly generative AI models architectures . During that time, the milestone model designs comprised
- Advances in anomaly detection, natural language processing, and image recognition were fuelled by variational autoencoders (VAEs).
- Diffusion models and generative adversarial networks (GANs) allowed for some of the earliest AI solutions for photo-realistic picture production and increased the accuracy of earlier applications.
- Today’s top foundation models and generative AI solutions are powered by Transformers, a deep learning model architecture.
Variational autoencoders (VAEs)
Two interconnected neural networks make up an autoencoder, a deep learning model. One network encodes vast amounts of unlabelled, unstructured training input into parameters, and the other network decodes those parameters to reconstruct the content. Autoencoders are technically capable of producing new content, but they are more effective at compressing data for transit or storage and decompressing it for usage than they are at producing high-quality video.
Variational autoencoders (VAEs), which were first introduced in 2013, are capable of encoding data similarly to an autoencoder while decoding numerous unique versions of the same material. Over time, a VAE can “zero in” on more precise, higher-fidelity information by being trained to produce variants towards a certain objective. Natural language synthesis and anomaly detection were among the early uses of VAE.
Generative adversarial networks (GANs)
GANs were introduced in 2014 and consist of two neural networks: a discriminator that evaluates input quality and a generator that generates new material. The model is encouraged to produce outpits of ever-higher quality by these adversarial methods.
Although GANs are frequently used to create images and videos, they may also produce realistic, high-quality content in a variety of other fields. They have shown significant effectiveness in jobs like data augmentation, which involves creating fresh, synthetic data to expand the size and diversity of a training data set, and style transfer, which involves changing the style of an image from, say, a photo to a pencil sketch.
Diffusion models
Diffusion models, which were also launched in 2014, operate by first introducing random and unrecognisable noise into the training data. The algorithm is then trained to iteratively diffuse the noise until the desired output is revealed.
In the end, diffusion models provide more precise control over output, especially for high-quality picture production tools, but they require more training time than VAEs or GANs. A diffusion model powers Open AI’s image-generation tool, DALL-E.
Transformers
Transformers, which were first described in a 2017 work by Ashish Vaswani and colleagues, advance the encoder-decoder paradigm to allow for significant improvements in the training of foundation models as well as in the calibre and variety of content they can generate. The majority of today’s highly publicised generative AI technologies, such as ChatGPT and GPT-4, Copilot, BERT, Bard, and Midjourney, are based on these models.
Transformers employ the idea of attention, which involves identifying and concentrating on the most crucial information in a sequence to
- Process complete data sequences, such as sentences, rather than just individual words at a time;
- Record the data’s context within the sequence;
- Transform the training data into embeddings that accurately reflect the data and its environment .