Building a Personalized Portrait Generator with Stable Diffusion, IP-Adapter, and LoRA

Published June 25, 2025

node: v12.16.2
Python: 3.12.13
PyTorch: 2.6.0
Diffusers: 0.32.2
Transformers: 4.49.0
Gradio: 5.22.0

Background

At first, my motivation was very practical. I wanted to generate a clean professional headshot or portrait that I could use as a profile picture. Going to a photo studio would be expensive for me, especially for something I mainly wanted to use online. I also looked at some existing AI headshot generation services, but after comparing the available options, many of them were still not cheap enough for the kind of experiment I wanted to run.

The goal of the final product was not just to generate a random good-looking portrait. The generated image still needed to look like the original person. At the same time, I wanted the system to support two useful visual directions: a professional style for profile pictures, resumes, or LinkedIn-like use cases, and a casual style for more relaxed personal profiles, which maked a problem of identity preservation and controllable style generation.

Goal

The project is a local Gradio application for personalized portrait generation. A user uploads a selfie, chooses a style, and receives a generated 512x512 portrait.

The intended model stack has three layers:

  1. Stable Diffusion / Realistic Vision as the base image generator.
  2. IP-Adapter-FaceID with InsightFace embeddings to preserve the person’s face identity.
  3. A custom LoRA adapter trained on professional and casual portrait examples to steer the final style.

This separation is important. Stable Diffusion is responsible for generating a realistic image. IP-Adapter-FaceID helps keep the generated face close to the uploaded selfie. LoRA provides a lightweight way to guide the final image toward a specific style without retraining the full diffusion model.

In other words, the system should answer three questions at the same time:

  • Does the image look realistic?
  • Does the face still look like the input person?
  • Does the style match the selected direction?

High-Level Architecture

The runtime flow is straightforward:

User selfie
  |
  v
Gradio UI
  |
  v
InsightFace face detection and embedding
  |
  v
IP-Adapter-FaceID
  |
  v
Stable Diffusion pipeline + Style LoRA
  |
  v
Generated professional or casual portrait

The Gradio UI stays thin. It only collects the input image, style selection, and displays the output. The model initialization and generation logic sit behind the UI.

For face identity, the uploaded image is processed with InsightFace. The app detects the face, extracts a normalized face embedding, and passes that embedding into IP-Adapter-FaceID. This gives the generation pipeline a stronger identity signal than prompt text alone.

For style control, the app uses prompt wording plus a trained LoRA. The base prompt describes a realistic portrait with natural lighting, sharp focus, and realistic skin texture. Then the app appends either a professional-style prompt or a casual-style prompt.

Why LoRA

LoRA was a practical fit for this project because it allowed me to train a small style adapter instead of fine-tuning the full Stable Diffusion model.

The dataset format was intentionally simple:

[
  {
    "file_path": "dataset/example.jpg",
    "label": 1
  }
]

In this project, label: 1 represents professional style, and label: 0 represents casual style. The dataset contained 709 records, with 332 professional examples and 377 casual examples. This was balanced enough for a small prototype and easy to inspect manually.

During training, the base model, VAE, text encoder, and original UNet weights stay frozen. LoRA adapters are attached to the UNet attention projection layers, and only those adapter weights are trained. Images are encoded into latent space, random diffusion noise is added, and the UNet learns to predict the noise under a style-specific prompt.

For example, professional images are paired with a prompt like:

portrait photo of a person with professional attire, <s1>

Casual images are paired with:

portrait photo of a person with t-shirt, <s2>

The training objective is the standard diffusion noise-prediction MSE loss. The final result is a small LoRA weight file that can be loaded into the generation pipeline when needed.

This is what makes LoRA useful here: the base model remains general, while the LoRA acts like a lightweight style layer that can be loaded, swapped, or adjusted separately.

Evaluation

The evaluation script focuses mainly on identity preservation. It uses the same general runtime stack: Stable Diffusion, IP-Adapter-FaceID, InsightFace, and the trained LoRA.

The flow is:

  1. Select a subject image.
  2. Extract the input face embedding.
  3. Generate one professional image and one casual image.
  4. Run face detection again on the generated images.
  5. Compare the input and generated face embeddings using cosine similarity.

This gives a rough measurement of whether the generated portrait still resembles the input person. It does not fully measure style quality. For that, visual inspection is still necessary.

A strong style LoRA can look good numerically but still create unwanted artifacts, overfit to training poses, or weaken identity preservation. For this kind of project, validation loss is useful, but generated samples matter more.

What Worked Well

The strongest part of the project is the conceptual split.

Identity, style, and generation are handled by different components. That makes the system easier to debug. If the face no longer looks like the input person, the issue is likely around face detection, embedding extraction, or IP-Adapter settings. If the style is weak, the issue is likely around the LoRA, dataset quality, prompt design, or training parameters. If the image quality is poor, the base model, scheduler, inference steps, or negative prompt may need adjustment.

Another good decision was using existing libraries instead of building everything from scratch. Diffusers handles the Stable Diffusion pipeline and LoRA loading. InsightFace handles face embeddings. Gradio makes the prototype easy to run locally. This kept the project focused on system integration and experimentation instead of low-level model implementation.

Problems and Lessons Learned

The biggest weakness was operational consistency. The code architecture was mostly in place, but the runtime depended on several local model artifacts. Some expected paths did not match the downloader script, and some model files were missing from the checkout. That means the app could not run end-to-end unless the correct model assets were restored manually.

There were also duplicated app variants and experimental files. This is common in fast-moving prototypes because the project evolves quickly, but it makes the final repo harder to understand. A cleaner version should keep one canonical app entry point and move shared logic into modules such as config.py, face.py, generation.py, and models_runtime.py.

Another issue was configuration. Model paths, inference steps, image size, guidance scale, and LoRA paths were hard-coded in Python files. That is acceptable for a local prototype, but a more maintainable version should use a small config layer or environment variables.

Finally, the style control was binary: professional or casual. A future version could expose a continuous slider, blending the two styles with different weights. However, that should be validated visually because prompt weighting and LoRA scaling do not always behave linearly.

Final Thoughts

The final architecture shows how different AI components can work together: Stable Diffusion generates the image, IP-Adapter-FaceID preserves identity, and LoRA adds controllable style. Each part has a clear responsibility, which makes the system easier to understand and improve.

The most important lesson is that a good ML prototype needs both model-side thinking and engineering-side discipline. Training the LoRA is only one part of the project. The app also needs consistent model paths, clean configuration, a reliable evaluation script, and a clear separation between active code and experiments.

As a prototype, this project already has a solid foundation. With better asset management, a cleaner runtime structure, and more systematic visual evaluation, it could become a stronger local personalized portrait generation tool.

Comments