Face Recognition and Neural Style Transfer

What is Face Recognition?
What is Neural Style Transfer?
- What are Deep ConvNets Learning?
- Cost Function
  - Content Cost Function
  - Style Cost Function
1D and 3D Generalizations
- 1D Generalization
- 3D Generalization
Summary

What is Face Recognition?

Face recognition is the task of identifying or verifying a person’s identity using their facial features. It can be broken down into three main categories:

Face Detection: Locate faces in an image (bounding box).
Face Verification: Check if two faces are of the same person (1:1 comparison).
Face Recognition/Identification: Identify a person from a database (1:N comparison).

Real-World Applications

Smartphone unlock (Face ID)
Security surveillance
Online proctoring
Social media tagging (e.g., Facebook)

One Shot Learning

Traditional classification algorithms require many training examples per class. However, in face recognition:

We might only have one image per person.
The task becomes: Can the model recognize a face it has seen only once?

This is known as One-Shot Learning.

Problem Setup

Instead of learning to classify, the model learns similarity between pairs of images.
A distance function is trained to return a small value for the same person, and large for different people.

Siamese Network

A Siamese Network consists of two identical ConvNets (with shared weights) that compare two inputs.

Architecture Overview

Two inputs: $x_{1}$ and $x_{2}$
Same CNN maps both to feature vectors $f (x_{1})$ and $f (x_{2})$
A distance metric (e.g., L2 norm) is applied:

$d (x_{1}, x_{2}) = ∥ f (x_{1}) - f (x_{2}) ∥_{2}^{2}$

Loss Function

A contrastive loss or triplet loss is used to train the network to minimize distances for same identities and maximize for different ones.

Triplet Loss

Triplet Loss is a powerful loss function for learning embeddings. It relies on triplets:

Anchor (A): A known image
Positive (P): Image of the same identity
Negative (N): Image of a different identity

We want:

$∥ f (A) - f (P) ∥_{2}^{2} + α < ∥ f (A) - f (N) ∥_{2}^{2}$

Where:

$f (x)$ is the embedding function (ConvNet output)
$α$ is a margin to separate positive and negative pairs

Loss Function

The Triplet Loss is:

$L (A, P, N) = max (∥ f (A) - f (P) ∥_{2}^{2} - ∥ f (A) - f (N) ∥_{2}^{2} + α, 0)$

Important Notes

Semi-hard negative mining improves convergence (choose negatives that are hard but not too hard).
Embeddings are often normalized to unit length.

Face Verification and Binary Classification

Once we have embeddings from a trained network (e.g., using triplet loss), we can perform face verification as a binary classification task.

Verification Pipeline

Encode both face images to embeddings.
Compute Euclidean distance or cosine similarity.
If distance < threshold $\Rightarrow$ same person.

Threshold $θ$ is selected based on False Positive Rate vs. True Positive Rate using ROC curve on a validation set.

What is Neural Style Transfer?

Neural Style Transfer is the task of synthesizing an image that:

Preserves the content of a content image
Adopts the style of a style image

Leverage a pre-trained ConvNet (like VGG19) to extract content and style representations.

Let:

$C$ be the content image
$S$ be the style image
$G$ be the generated image

Then we optimize $G$ to minimize a cost function:

$J (G) = α J_{co n t e n t} (C, G) + β J_{s t y l e} (S, G)$

What are Deep ConvNets Learning?

Deep ConvNets learn hierarchical representations:

Early layers: edges, colors, textures
Mid layers: shapes, motifs
Later layers: object-level concepts

In NST, content is encoded in deeper layers, style in shallower layers.

Cost Function

The total cost is:

$J (G) = α J_{co n t e n t} (C, G) + β J_{s t y l e} (S, G)$

Where:

$α$ : weight for content preservation
$β$ : weight for style transfer
Typically: $α = 1$ , $β = 1 0^{3}$ to $1 0^{4}$

Content Cost Function

Let $a^{[l] (C)}$ and $a^{[l] (G)}$ be activations at layer $l$ for the content and generated images.

Then content cost is:

$J_{co n t e n t} (C, G) = \frac{1}{2} ∥ a^{[l] (C)} - a^{[l] (G)} ∥_{2}^{2}$

Use a deeper layer (e.g., conv4_2) for this.

Style Cost Function

Style is captured by correlations between feature maps using a Gram matrix.

Let $a^{[l] (S)}$ be the activations at layer $l$ for style image. Compute Gram matrix:

$G_{ij}^{[l]} = k \sum a_{ik}^{[l]} a_{jk}^{[l]}$

Style cost is:

$J_{s t y l e}^{[l]} (S, G) = \frac{1}{( 2 n _{H} n _{W} n _{C} ) ^{2}} ∥ G^{[l] (S)} - G^{[l] (G)} ∥_{F}^{2}$

Then sum over multiple layers:

$J_{s t y l e} (S, G) = l \sum λ^{[l]} J_{s t y l e}^{[l]} (S, G)$

1D and 3D Generalizations

1D Generalization

Neural style transfer principles can be applied to audio signals:

1D convolution over waveform
Preserve temporal content, apply style of another sound

3D Generalization

Applied to volumetric data such as:

3D MRI scans
3D point clouds
Transfer spatial styles across 3D volumes

These require 3D convolutional layers and custom Gram matrix calculations.

Summary

Face Recognition uses embedding learning (Triplet loss, Siamese networks).
One-shot learning enables models to generalize with limited data.
Neural Style Transfer uses a pre-trained CNN to blend content and style images using a combination of content/style loss.
Both applications showcase the expressive power of deep convolutional networks beyond classic classification.