Artificial Intelligence and Advanced Analytics 02

By the end of this lecture, we'll have learnt about:
The theoretical:
- Introduction to the unit
- Quick revisions on data modalities and neural networks
- AI applications in Computer Vision part 01
- - Image classification, Object detection, Image segmentation, Keypoint detection
The practical:
- test(inference) models on hugging face
- test(inference) models from Tensorflow.js
- research on models adopted by Apple Core ML
- test(inference) YOLOv11 on your laptop / google colab notebook

- Large language model (week 01) 🗣️
- AI in Computer Vision (week 02, 03) 👁️
- AI in audio and music (week 04, 05) 🎵
- AI in 3D and gaming (week 06, 07) 🎮
- AI in affective computing (week 08) 🧠
- AI in recommender systems (week 09) 🤖
- Multi-modal AI (week 10) 🌀
- Explainable AI (week 11) 🔍

How to get support?
- Reach out to me and Kayal via Slack or Email
- The fun of being an autodidact: Hugging Face, UAL Library, ChatGPT but be careful, resources listed on our Moodle Page

This is a very fun and practice-based unit, to get you ready for your final thesis project and beyond.
- We'll play around with a LOT of pre-trained and ready-to-use models during each lecture. 🤗
- We expect you to spend 16 hours each week outside the lecture time to do the readings📖, hands-on practice & homework✋ on this unit.
- It will be a very rewarding experience if you put in the effort. We will support you all the way! 🫵

About me
xiaowan yi: sh-iao one e
- I was born in Chengdu, China. My city is famous for panda 🐼, taoism ☯️ and spicy food 🌶️.
- I live in Surrey Quays now.
- I'm completing my PhD research in AI&Music at QMUL.
- I make sound and you can find some of my works here.
- I play drums 🥁 for electronic and groovy music.

Data modalities🕶️
Information that we can gather from the world and store in digital systems as "data", are mainly from four modalities:
- 👁️Image (picture, video)
- 📝Text (written language)
- 🎵Audio (music, speech)
- 🧩Tabular(this week's weather in degree celsius, everyone's birthday in this class, sensor data, etc.)
Can you think of any information that is not from the four categories?

Data modalities🕶️
- Each data modality has its own characteristics and challenges for representation, processing and analysis.
- We structure our unit syllabus based on the different data modalities, each week focusing on one modality (with a few exceptions).

Modalities covered in each week:
- Large language model 🗣️: text, multi-modal
- AI in Computer Vision (week 02, 03) 👁️: images, multi-modal
- AI in audio and music (week 04, 05) 🎵: audio, multi-modal
- AI in 3D and gaming (week 06, 07) 🎮: images, multi-modal
- AI in affective computing (week 08) 🧠: text, audio, image
- AI in recommender systems (week 09) 🤖: tabular, multi-modal
- Multi-modal AI (week 10) 🌀: well
- Explainable AI (week 11) 🔍: modality invariant

Recap on neural networks 🤖
- Neural networks are a type of machine learning model that are inspired by the structure and function of the biological neural networks.

- Key points:
- - Architectures
- - How to train a neural network?
- How to inference a neural network?
Recommended reading:
- - Neural Networks and Deep Learning by Michael Nielsen
- - Deep Learning Book by Ian Goodfellow, Yoshua Bengio and Aaron Courville

Neural network architecture 🎳
- Neural nets comprise of layers.
- There are different types of layers (fully connected, convolutional, recurrent, attention, etc.)
- Here is a list of layers implemented by Pytorch (have you used Pytorch before?)

🤔 How to inference a neural network? The process:
- - feed input data into the model and get the output from the model
- - also called the forward pass
- - also called 'testing' or 'evaluation'
- - inference does not involve backpropagation
!No need to hard memorise!

🤔 How to train a neural network? The process:
- - feed input data into the model and get the output from the model
- - compare the model output with the ground truth and calculate the loss
- - use backpropagation to update the model parameters
- - repeat the above steps until the model converges
- - this whole process is called training
- - training involves both forward pass and backward pass

🤔 How to train a neural network? The dataset and splits:
- - training is done on a training dataset
- - the trained model is then evaluated on a validation dataset and/or a test dataset
- - the trained model is then deployed for inference on new data

🤔 How to train a neural network? The hyperparameters:
- - training is done using an optimisation algorithm (e.g. SGD, Adam, etc.)
- - training is done using a loss function (e.g. cross-entropy, MSE, etc.)
- - training is done using a learning rate (e.g. 0.001, 0.01, etc.)
- - training is done using a batch size (e.g. 32, 64, etc.)
- - training is done using a number of epochs (e.g. 10, 20, etc.)
- - training is done using a validation set to monitor overfitting
- - training is done using regularisation techniques (e.g. dropout, weight decay, etc.)

!No need to hard memorise! These knowledges are best internalised through hands-on practice.

📙 Terminologies:
1️⃣Pre-trained
- - means a ready-to-use model
- - means a model that has been trained on a dataset (often large) and can be fine-tuned on a smaller dataset for a specific task in the future

2️⃣Fine-tuning
- - means further training a pre-trained model on a smaller dataset for a specific task
- - means updating the model parameters using backpropagation on the smaller dataset
- - means the model is adapted to the specific task and the dataset
- - means the model is not trained from scratch

3️⃣Transfer learning
- - means using a pre-trained model for a different but related task
- - also means the model is not trained from scratch
- - fine-tuning is a special case of transfer learning
- - transfer learning can be done without fine-tuning (e.g. using the pre-trained model as a feature extractor and adding new layers to the model for training on the new task)

First question: how does a model in our computer "see" an image? 👁️
- Model "sees" an image as numbers!
- Digital images are made of pixels.
- Each pixel in the image is represented by a number (or a set of numbers for color images).

👁️ Images represented by numbers:
- Two numbers for its width and height (how many pixels).
e.g. 3840 x 2160 for 4K resolution
- Sometimes another number for how many color channels there are.
e.g. 256 x 256 x 3 for an RGB color image

Here is one way to numberify digital images:
- Three numbers for each pixel representing the RGB values in color images, e.g. [256, 0, 128] for purple 🟪
- One number for each pixel representing the greyscale value in grey images, e.g. [128] for a medium grey 🩶

- Put together these color-indicating numbers for all pixels in a multi-dimensional array (also called tensor in deep learning) to represent the image.

👁️ Basic computer vision tasks, characterised by the model (neural network) outputs:
- Image classification: outputs a class label.
- Object detection: outputs bounding boxes and class labels.
- Image segmentation: outputs pixel-wise masks and class labels.
- Keypoint detection: outputs Keypoint and class labels.

👁️ What exactly "meaningful information", in a more human friendly language, do these tasks derive from images?
- Image classification: what is in the image?
- Object detection: what is in the image and where is it roughly?
- Image segmentation: what is in the image and where is it at the finest level?
- Keypoint detection: where are the important points/coordinates in the image?

👁️ Let's go to Hugging Face (have you explored this platform before?) for web-based, ready-to-use models for:
- Image classification
- Object detection
- Image segmentation
- Keypoint detection

👁️ Let's go to Tensorflow.js github repo (have you explored this platform before?) for models and demos for:
- Image classification
- Object detection
- Image segmentation
- Hand/Pose/Facial Keypoint detection

fun AI time 🎉
Artworks that uses face detection
- Pareidolia*
- - facial detection is applied to grains of sand. A fully automated robot search engine examines the grains of sand in situ. When the machine finds a face in one of the grains, the portrait is recorded.
- Hello
- - a large-scale kinetic sculpture in the form of an ancient Greek architectural pillar which observes its surroundings as if nodding to visitors, moving like a mutated snake.

Check out computer vision models adopted by Apple Core ML
What attibutes are important for models to be industry standard?
- Performance in accuracy/precision/robustness, size, inference time (speed), etc.
✋Practice:
- Pick one model that can do one of the four basic computer vision tasks, and research on:
- - What is the model name?
- - Is there a public code repository e.g. github?
- - What dataset is the model trained on?
- - What is the size of the model?
- - Any inference time (speed) information?
- - Is the model trained from scratch or fine-tuned on another model?
- - What is the model performance like, on what evaluation metrics?
👐 Let's share our findings after 25 mins!👐

YOLO on Ultralytics
- an advanced computer vision model family first introduced by Joseph Redmon et al. in 2015, now maintained and developed (with different versions) by Ultralytics, known for its high speed, accuracy, and versatility.
- are deployed in various industry applications including autonomous vehicles, robotics, text recognition, visual inspection systems, and more.
- now it is YOLOv11, though YOLOv8 is also widely used.
- what is the full name of YOLO?
- A gentle introduction from the main developer gentle introduction from the main developer

YOLO on Ultralytics
Hands-on:
- Inference YOLOv11 on the google colab notebook.
- Setup YOLOv11 on your laptop and inference YOLOv11 from your laptop.
📘 Keep a cool dev note of:
- - Error messages you have encountered
- - Words that you don't understand
- - Things that you find interesting

YOLO on Ultralytics
Homework:
- Inference YOLOv11 on your laptop for all 4 tasks, keep a record of the inference time.
- Setup YOLOv11 on the workstation and inference YOLOv11 from the workstation utilising GPU, keep a record of the inference time.
- SEND ME your dev note and inference time records by next Monday!
- [Optional but highly rewarding] Fine tuning YOLOv11 on the workstation, using a custom dataset (think of an applicatin that is interesting to you! check public datasets on roboflow for inspiration).

🕶️ What we have learnt today:
- Recap on data modalities and neural networks
- Image classification, Object detection, Image segmentation, Keypoint detection
- Played around with pre-trained computer vision models on Hugging Face, Tensorflow.js and Apple Core ML
- Inspect and inference an example of industry-level computer vision model: YOLOv11