AIAA👐
Lecture 02
Neural networks quick recap + AI in Computer Vision part 1 👁️
Welcome 👩🎤🧑🎤👨🎤
By the end of this lecture, we'll have learnt about:
The theoretical:
- Introduction to the unit
- Quick revisions on data modalities and neural networks
- AI applications in Computer Vision part 01
- - Image classification, Object detection, Image segmentation, Keypoint detection
The practical:
- test(inference) models on hugging face
- test(inference) models from Tensorflow.js
- research on models adopted by Apple Core ML
- test(inference) YOLOv11 on your laptop / google colab notebook
First of all, don't forget to confirm your attendence on
SEAtS App!
What is this unit about?
State-of-the-art AI deployment in professional ecosystems, with applications and advanced analytics in a wide range of domains including:
- Large language model (week 01) 🗣️
- AI in Computer Vision (week 02, 03) 👁️
- AI in audio and music (week 04, 05) 🎵
- AI in 3D and gaming (week 06, 07) 🎮
- AI in affective computing (week 08) 🧠
- AI in recommender systems (week 09) 🤖
- Multi-modal AI (week 10) 🌀
- Explainable AI (week 11) 🔍
This is a very fun and practice-based unit, to get you ready for your final thesis project and beyond.
- We'll play around with a LOT of pre-trained and ready-to-use models during each lecture. 🤗
- We expect you to spend 16 hours each week outside the lecture time to do the readings📖, hands-on practice & homework✋ on this unit.
- It will be a very rewarding experience if you put in the effort. We will support you all the way! 🫵
any questions so far?
About me
xiaowan yi: sh-iao one e
- I was born in Chengdu, China. My city is famous for panda 🐼, taoism ☯️ and spicy food 🌶️.
- I live in Surrey Quays now.
- I'm completing my PhD research in AI&Music at QMUL.
- I make sound and you can find some of my works
here.
- I play drums 🥁 for electronic and groovy music.
Recap on "data modality"
Data modalities🕶️
Information that we can gather from the world and store in digital systems as "data", are mainly from four modalities:
- 👁️Image (picture, video)
- 📝Text (written language)
- 🎵Audio (music, speech)
- 🧩Tabular(this week's weather in degree celsius, everyone's birthday in this class, sensor data, etc.)
Can you think of any information that is not from the four categories?
Data modalities🕶️
- Each data modality has its own characteristics and challenges for representation, processing and analysis.
- We structure our unit syllabus based on the different data modalities, each week focusing on one modality (with a few exceptions).
Modalities covered in each week:
- Large language model 🗣️: text, multi-modal
- AI in Computer Vision (week 02, 03) 👁️: images, multi-modal
- AI in audio and music (week 04, 05) 🎵: audio, multi-modal
- AI in 3D and gaming (week 06, 07) 🎮: images, multi-modal
- AI in affective computing (week 08) 🧠: text, audio, image
- AI in recommender systems (week 09) 🤖: tabular, multi-modal
- Multi-modal AI (week 10) 🌀: well
- Explainable AI (week 11) 🔍: modality invariant
Recap on neural networks
Recap on neural networks 🤖
- Neural networks are a type of machine learning model that are inspired by the structure and function of the biological neural networks.
- Key points:
- - Architectures
- - How to train a neural network?
- How to inference a neural network?
Recommended reading:
- -
Neural Networks and Deep Learning by Michael Nielsen
- -
Deep Learning Book by Ian Goodfellow, Yoshua Bengio and Aaron Courville
Neural network architecture 🎳
- Neural nets comprise of layers.
- There are different types of layers (fully connected, convolutional, recurrent, attention, etc.)
- Here is
a list of layers implemented by Pytorch (have you used Pytorch before?)
Questions for checking understanding 🤔
- What is a MLP?
- What is a CNN?
- What is a RNN?
- What is a Transformer?
🤔 How to inference a neural network? The process:
- - feed input data into the model and get the output from the model
- - also called the forward pass
- - also called 'testing' or 'evaluation'
- - inference does not involve backpropagation
!No need to hard memorise!
🤔 How to train a neural network? The process:
- - feed input data into the model and get the output from the model
- - compare the model output with the ground truth and calculate the loss
- - use backpropagation to update the model parameters
- - repeat the above steps until the model converges
- - this whole process is called training
- - training involves both forward pass and backward pass
🤔 How to train a neural network? The dataset and splits:
- - training is done on a training dataset
- - the trained model is then evaluated on a validation dataset and/or a test dataset
- - the trained model is then deployed for inference on new data
🤔 How to train a neural network? The hyperparameters:
- - training is done using an optimisation algorithm (e.g. SGD, Adam, etc.)
- - training is done using a loss function (e.g. cross-entropy, MSE, etc.)
- - training is done using a learning rate (e.g. 0.001, 0.01, etc.)
- - training is done using a batch size (e.g. 32, 64, etc.)
- - training is done using a number of epochs (e.g. 10, 20, etc.)
- - training is done using a validation set to monitor overfitting
- - training is done using regularisation techniques (e.g. dropout, weight decay, etc.)
!No need to hard memorise! These knowledges are best internalised through hands-on practice.
📙 Terminologies:
1️⃣Pre-trained
- - means a ready-to-use model
- - means a model that has been trained on a dataset (often large) and can be fine-tuned on a smaller dataset for a specific task in the future
2️⃣Fine-tuning
- - means further training a pre-trained model on a smaller dataset for a specific task
- - means updating the model parameters using backpropagation on the smaller dataset
- - means the model is adapted to the specific task and the dataset
- - means the model is not trained from scratch
3️⃣Transfer learning
- - means using a pre-trained model for a different but related task
- - also means the model is not trained from scratch
- - fine-tuning is a special case of transfer learning
- - transfer learning can be done without fine-tuning (e.g. using the pre-trained model as a feature extractor and adding new layers to the model for training on the new task)
AI in Computer Vision part 01 👁️
Computer Vision: a field of AI that teaches computers to "see" and interpret the visual world from images and videos, and derive meaningful information from them.
First question: how does a model in our computer "see" an image? 👁️
- Model "sees" an image as numbers!
- Digital images are made of pixels.
- Each pixel in the image is represented by a number (or a set of numbers for color images).
👁️ Images represented by numbers:
- Two numbers for its width and height (how many pixels).
e.g. 3840 x 2160 for 4K resolution
- Sometimes another number for how many color channels there are.
e.g. 256 x 256 x 3 for an RGB color image
Here is one way to numberify digital images:
- Three numbers for each pixel representing the RGB values in color images, e.g. [256, 0, 128] for purple 🟪
- One number for each pixel representing the greyscale value in grey images, e.g. [128] for a medium grey 🩶
- Put together these color-indicating numbers for all pixels in a multi-dimensional array (also called tensor in deep learning) to represent the image.
After "seeing" the images, what meaningful information a model can derive from them? 👁️
👁️ Basic computer vision tasks:
- Image classification
- Object detection
- Image segmentation
- Keypoint detection
👁️ Basic computer vision tasks, characterised by the model (neural network) outputs:
- Image classification: outputs a class label.
- Object detection: outputs bounding boxes and class labels.
- Image segmentation: outputs pixel-wise masks and class labels.
- Keypoint detection: outputs Keypoint and class labels.
👁️ What exactly "meaningful information", in a more human friendly language, do these tasks derive from images?
- Image classification: what is in the image?
- Object detection: what is in the image and where is it roughly?
- Image segmentation: what is in the image and where is it at the finest level?
- Keypoint detection: where are the important points/coordinates in the image?
👁️ Let's go to
Hugging Face (have you explored this platform before?) for web-based, ready-to-use models for:
- Image classification
- Object detection
- Image segmentation
- Keypoint detection
👁️ Let's go to
Tensorflow.js github repo (have you explored this platform before?) for models and demos for:
- Image classification
- Object detection
- Image segmentation
- Hand/Pose/Facial Keypoint detection
fun AI time 🎉
Artworks that uses face detection
-
Pareidolia*
- - facial detection is applied to grains of sand. A fully automated robot search engine examines the grains of sand in situ. When the machine finds a face in one of the grains, the portrait is recorded.
-
Hello
- - a large-scale kinetic sculpture in the form of an ancient Greek architectural pillar which observes its surroundings as if nodding to visitors, moving like a mutated snake.
Check out computer vision models adopted by
Apple Core ML
These are the industry-level models.
What attibutes are important for models to be industry standard?
Check out computer vision models adopted by
Apple Core ML
What attibutes are important for models to be industry standard?
- Performance in accuracy/precision/robustness, size, inference time (speed), etc.
✋Practice:
- Pick one model that can do one of the four basic computer vision tasks, and research on:
- - What is the model name?
- - Is there a public code repository e.g. github?
- - What dataset is the model trained on?
- - What is the size of the model?
- - Any inference time (speed) information?
- - Is the model trained from scratch or fine-tuned on another model?
- - What is the model performance like, on what evaluation metrics?
👐 Let's share our findings after 25 mins!👐
YOLO on
Ultralytics
- an advanced computer vision model family first introduced by Joseph Redmon et al. in 2015, now maintained and developed (with different versions) by Ultralytics, known for its high speed, accuracy, and versatility.
- are deployed in various industry applications including autonomous vehicles, robotics, text recognition, visual inspection systems, and more.
- now it is YOLOv11, though
YOLOv8 is also widely used.
- what is the full name of YOLO?
- A gentle introduction from the main developer
gentle introduction from the main developer
YOLO on
Ultralytics
Hands-on:
- Inference YOLOv11 on the google colab notebook.
- Setup YOLOv11 on your laptop and inference YOLOv11 from your laptop.
📘 Keep a cool dev note of:
- - Error messages you have encountered
- - Words that you don't understand
- - Things that you find interesting
YOLO on
Ultralytics
Homework:
- Inference YOLOv11 on your laptop for all 4 tasks, keep a record of the inference time.
- Setup YOLOv11 on the workstation and inference YOLOv11 from the workstation utilising GPU, keep a record of the inference time.
- SEND ME your dev note and inference time records by next Monday!
- [Optional but highly rewarding] Fine tuning YOLOv11 on the workstation, using a custom dataset (think of an applicatin that is interesting to you! check public datasets on
roboflow for inspiration).
🕶️ What we have learnt today:
- Recap on data modalities and neural networks
- Image classification, Object detection, Image segmentation, Keypoint detection
- Played around with pre-trained computer vision models on Hugging Face, Tensorflow.js and Apple Core ML
- Inspect and inference an example of industry-level computer vision model: YOLOv11
We'll see you next Monday same time and same place!