🚀
10 Days Realtime LLM Bootcamp
  • Introduction
    • Getting Started
    • Course Syllabus
    • Course Structure
    • Prerequisites
    • Greetings from your Instructors
    • First Exercise (Ungraded)
  • Basics of LLM
    • What is Generative AI?
    • What is a Large Language Model?
    • Advantages and Applications of Large Language Models
    • Bonus Resource: Multimodal LLMs
  • Word Vectors Simplified
    • What is a Word Vector
    • Word Vector Relationships
    • Role of Context in LLMs
    • Transforming Vectors into LLM Responses
      • Neural Networks and Transformers (Bonus Module)
      • Attention and Transformers (Bonus Module)
      • Multi-Head Attention and Further Reads (Bonus Module)
    • Let's Track Our Progress
  • Prompt Engineering
    • What is Prompt Engineering
    • Prompt Engineering and In-context Learning
    • Best Practices to Follow in Prompt Engineering
    • Token Limits in Prompts
    • Prompt Engineering Excercise
      • Story for the Excercise: The eSports Enigma
      • Tasks in the Excercise
  • Retrieval Augmented Generation and LLM Architecture
    • What is Retrieval Augmented Generation (RAG)?
    • Primer to RAG Functioning and LLM Architecture: Pre-trained and Fine-tuned LLMs
    • In-Context Learning
    • High level LLM Architecture Components for In-context Learning
    • Diving Deeper: LLM Architecture Components
    • LLM Architecture Diagram and Various Steps
    • RAG versus Fine-Tuning and Prompt Engineering
    • Versatility and Efficiency in Retrieval-Augmented Generation (RAG)
    • Key Benefits of RAG for Enterprise-Grade LLM Applications
    • Similarity Search in Vectors (Bonus Module)
    • Using kNN and LSH to Enhance Similarity Search in Vector Embeddings (Bonus Module)
    • Track your Progress
  • Hands-on Development
    • Prerequisites
    • Dropbox Retrieval App in 15 Minutes
      • Building the app without Dockerization
      • Understanding Docker
      • Using Docker to Build the App
    • Amazon Discounts App
      • How the Project Works
      • Step-by-Step Process
    • How to Run the Examples
  • Live Interactions with Jan Chorowski and Adrian Kosowski | Bonus Resource
  • Final Project + Giveaways
    • Prizes and Giveaways
    • Tracks for Submission
    • Final Submission
Powered by GitBook
On this page
  • What are Multimodal Models?
  • Exploring various data modalities
  • Bonus Resource: Recommended if you're already aware of the Encoder-Decoder Architecture
  • Papers for Further Reading
  1. Basics of LLM

Bonus Resource: Multimodal LLMs

Note: In response to the growing interest in multimodal LLMs, this section was added at the conclusion of the bootcamp.

PreviousAdvantages and Applications of Large Language ModelsNextWord Vectors Simplified

Last updated 1 year ago

At this point, you're likely familiar with large language models (LLMs) and generative AI. Now, let's delve into the exciting world of Google Gemini via this short explainer by Mudit Srivastava.

Google Deepmind introduced Gemini in December 2023, showcasing its potential to revolutionize the capabilities of LLMs. This launch, however, didn't come without its , especially regarding Google's approach to refining Gemini's initial release for a more polished presentation. This practice, quite prevalent among developers and innovators, often stirs up a vital discussion on ethics. But setting that aside, one thing is clear: Google's video offers an intriguing peek into the possibilities of multimodal LLMs. It's an exciting hint at what the future holds for enthusiasts like us in the world of generative AI. Let's dive into the video and see what's in store!

DeepMind's release comprised three models: Gemini Pro, which matches the abilities of GPT-3.5, and the more advanced Gemini Ultra, surpassing GPT-4. The Nano versions, optimized for mobile use, add an extra layer of innovation.

But have you ever wondered what sets multimodal LLMs apart? Unlike the conventional text-only models, these models are unique in their ability to process diverse data types. Let's dive into this intriguing world!

What are Multimodal Models?

Envision an AI that perceives the world not only through text but also through visuals, audio, and beyond. This is the essence of multimodal models. A notable illustration is found in Google DeepMind's research on Google Gemini.

In this example, Gemini showcases its capability for inverse graphics, where it deduces the underlying code that could have produced specific plots. This process involves reconstructing the visual elements into code and applying necessary mathematical transformations to generate the corresponding code accurately.

Exploring various data modalities

  • Audio as Visuals: Imagine audio waves transformed into visual spectrums like mel spectrograms. This conversion offers a new perspective, making audio data visually interpretable.

  • Speech into Text: When we transcribe speech, we're capturing words but also losing out on nuances like the speaker's tone, volume, and pauses. It's a trade-off between capturing the literal and missing the emotional cues.

  • Images in Textual Form: Here's where it gets interesting. An image, in essence, can be broken down into a vector format and then represented as a sequence of text tokens. It's like translating a visual story into a textual narrative.

  • Videos – Beyond Moving Images: While it's common to see videos as sequences of images, this overlooks the rich layer of audio that accompanies them. Remember, in platforms like TikTok, sound is not just an add-on; it's a vital part of the experience for most users.

  • Text as Images: Something as simple as photographing text turns it into an image. This is a straightforward but effective way of changing data modalities.

  • Data Tables to Visual Charts: Converting tabular data into charts or graphs transforms dry numbers into engaging visuals, enhancing understanding and insight.

Beyond these, think about the potential of other data types. The possibilities would be endless if we could effectively teach models to learn from bitstrings, the foundational elements of digital data. Imagine a model that could seamlessly learn from any data type!

What about data types like graphs, 3D assets, or even sensory data like smell and touch (haptics)? While we haven't delved deeply into these areas yet, the future of MLLMs in these uncharted territories is both exciting and promising!

Bonus Resource: Recommended if you're already aware of the Encoder-Decoder Architecture

To delve deeper into the workings of Google Gemini, it's essential to understand its architecture, rooted in the encoder-decoder model. Though not elaborated in detail in their publications so far, Gemini's design appears to draw from DeepMind's Flamingo, which features separate text and vision encoders.

Papers for Further Reading

Interested in their applications for Android? .

And for those of you who are developers, the Gemini API is now accessible on Kaggle. Explore .

This section draws upon the valuable insights from an informative . In Multimodal Large Language Models (MLLMs), we explore the fascinating world where different data types are translated and interchanged, opening up a realm of possibilities. Let's take a closer look:

Following our in this bootcamp, we've also included a Bonus Resource – a live session on Pathway led by Dr Vijay Srinivas Agneeswaran (Sr. Director, ML Research at Microsoft). This session is ideal for those interested in delving deeper into the role of computer vision in LLMs and exploring Microsoft's advancements in vision transformers.

Explore here
it here
write-up on MLLMs by Chip Huyen
module on Transformers
Wu, S., Fei, H., Qu, L., Ji, W., & Chua, T.-S. (2023). NExT-GPT: Any-to-Any Multimodal LLM. NExT++, School of Computing, National University of Singapore
Gemini Team. (2023). Gemini: A Family of Highly Capable Multimodal Models. Google
Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., & Chen, E. (2023). A Survey on Multimodal Large Language Models. USTC & Tencent YouTu Lab
😄
share of critique
Source: Gemini Team, Google (2023). Gemini: A Family of Highly Capable Multimodal Models