Hugging Face diffusers の GitHub リポジトリを Fork して FLUX.2 [klein] 4B を Google Colab PRO でデバッグ実行

1. 概要

こちらのブログに記載した内容と同様のデバッグ実行を FLUX.2 [klein] 4B で実行しました。

2. Debug 実行を試すためのサンプルの Google Colab のページ

こちらのリンク先にデバッグ実行を試すためのサンプルの Google Colab のページを用意しました。

Hugging Face のdiffusers の GitHub リポジトリを fork し、調査用のブランチ investigate_flux.2_klein_4B を作成しました。このブランチには調査用のブレイクポイントが下記のように Flux2KleinPipeline.__call__ の Denoising loop の前にセットされています。

diff --git a/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py b/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py
index 936d2c380..b88543efe 100644
--- a/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py
+++ b/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py
@@ -818,7 +818,9 @@ class Flux2KleinPipeline(DiffusionPipeline, Flux2LoraLoaderMixin):
         )
         num_warmup_steps = max(len(timesteps) - num_inference_steps * self.scheduler.order, 0)
         self._num_timesteps = len(timesteps)
-
+
+        import ipdb; ipdb.set_trace()
+
         # 7. Denoising loop
         # We set the index here to remove DtoH sync, helpful especially during compilation.
         # Check out more details here: https://github.com/huggingface/diffusers/pull/11696

3. Debug 実行のログの例

こちらのリンク先のコードセルを順に実行していくと、7つ目のコードセルを実行したときに下記のログのようにデバッガ ipdb を使用してデバッグ実行できます。

下記のログでは、timesteps 配列の中身を確認した後、画像情報を保持している多次元配列 latents のサイズを確認しています。

> /content/diffusers/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py(827)__call__()
    826         # Check out more details here: https://github.com/huggingface/diffusers/pull/11696
--> 827         self.scheduler.set_begin_index(0)
    828         with self.progress_bar(total=num_inference_steps) as progress_bar:

...

ipdb> p timesteps
tensor([1000.0000,  967.3840,  908.1439,  767.2000], device='cuda:0')

...

ipdb> n
> /content/diffusers/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py(875)__call__()
    874                 latents_dtype = latents.dtype
--> 875                 latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]
    876 

ipdb> p latents.shape
torch.Size([1, 4096, 128])

...

ipdb> until 900
> /content/diffusers/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py(900)__call__()
    899 
--> 900         latents = self._unpack_latents_with_ids(latents, latent_ids)
    901 

ipdb> p latents.shape
torch.Size([1, 4096, 128])
ipdb> n
> /content/diffusers/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py(902)__call__()
    901 
--> 902         latents_bn_mean = self.vae.bn.running_mean.view(1, -1, 1, 1).to(latents.device, latents.dtype)
    903         latents_bn_std = torch.sqrt(self.vae.bn.running_var.view(1, -1, 1, 1) + self.vae.config.batch_norm_eps).to(

ipdb> p latents.shape
torch.Size([1, 128, 64, 64])

...

ipdb> n
> /content/diffusers/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py(907)__call__()
    906         latents = latents * latents_bn_std + latents_bn_mean
--> 907         latents = self._unpatchify_latents(latents)
    908         if output_type == "latent":

ipdb> p latents.shape
torch.Size([1, 128, 64, 64])
ipdb> n
> /content/diffusers/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py(908)__call__()
    907         latents = self._unpatchify_latents(latents)
--> 908         if output_type == "latent":
    909             image = latents

ipdb> p latents.shape
torch.Size([1, 32, 128, 128])
ipdb> n
> /content/diffusers/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py(911)__call__()
    910         else:
--> 911             image = self.vae.decode(latents, return_dict=False)[0]
    912             image = self.image_processor.postprocess(image, output_type=output_type)

ipdb> n
> /content/diffusers/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py(912)__call__()
    911             image = self.vae.decode(latents, return_dict=False)[0]
--> 912             image = self.image_processor.postprocess(image, output_type=output_type)
    913 

ipdb> whatis image
<class 'torch.Tensor'>
ipdb> p image.shape
torch.Size([1, 3, 1024, 1024])
ipdb> n
> /content/diffusers/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py(915)__call__()
    914         # Offload all models
--> 915         self.maybe_free_model_hooks()
    916 

ipdb> whatis image
<class 'list'>
ipdb> pp image
[<PIL.Image.Image image mode=RGB size=1024x1024 at 0x78A8260D47D0>]
ipdb> c

上記のログの下記の行は、Denoising loop 完了後の潜在空間の多次元配列の並びを torch.Size([1, 4096, 128]) から torch.Size([1, 128, 64, 64]) に変換しています。潜在空間における高さ方向と幅方向の画像データを一列に並べた長さ 4096 の配列を、latent_ids を参照し、4096 = 64 x 64 の高さ方向と幅方向に並べられた多次元配列に変換しています。

--> 900         latents = self._unpack_latents_with_ids(latents, latent_ids)

上記のログの下記の行は、高さ方向と幅方向に並べられた多次元配列の並びを torch.Size([1, 128, 64, 64]) から torch.Size([1, 32, 128, 128]) に変換しています。128 次元のベクトルデータを 1/4 の 32 にし、高さと幅を 2 倍にしています。

torch.Size([1, 32, 128, 128]) は、VAE (Variational Autoencoder) により画像に戻す前の潜在空間における画像データの表現です。Denoising loop 内ではこれを 2 x 2 の小領域ごとにまとめ、高さを 1/2、幅を 1/2 にして処理していました。

--> 907         latents = self._unpatchify_latents(latents)

上記のログの下記の行は VAE のデコード処理を実行し、潜在空間での画像表現を RGB の 3チャネルのデータが 1024 x 1024 pixel 並んだ画像データに変換しています。デコード前の latents の多次元配列のサイズは torch.Size([1, 32, 128, 128])、デコード後の image の多次元配列のサイズは torch.Size([1, 3, 1024, 1024]) となっています。

--> 911             image = self.vae.decode(latents, return_dict=False)[0]

デバッグ実行しないでそのまま実行を継続する場合は ipdb> のプロンプトの横に c を入力し、Enter キーを押すようにします。

> /content/diffusers/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py(827)__call__()
    826         # Check out more details here: https://github.com/huggingface/diffusers/pull/11696
--> 827         self.scheduler.set_begin_index(0)
    828         with self.progress_bar(total=num_inference_steps) as progress_bar:

ipdb> c
100%
 4/4 [00:08<00:00,  1.24s/it]

4. 出力画像

下の画像はこちらのリンク先のコードセルを順に実行し、7つ目の下記のコードセルを実行して出力された画像です。

A cat holding a sign that says “Gifu AI Study Group” というプロンプト文字列を指定して画像を出力しています。

device = "cuda"
prompt = 'A cat holding a sign that says "Gifu AI Study Group"'

image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    guidance_scale=1.0,
    num_inference_steps=4,
    generator=torch.Generator(device=device).manual_seed(0)
).images[0]

image.save("flux-klein.png")

下の画像はこちらのリンク先のコードセルを順に実行し、8つ目の下記のコードセルを実行して出力された画像です。

プロンプト文字列と画像の両方を入力としています。入力画像は私が以前撮影した登山道の写真で、プロンプトではこれを背景として使用するように指定しています。こちらの画像では Gifu AI Study Group の Study の文字が少し崩れてしまいました。

from diffusers.utils import load_image

device = "cuda"

url = 'https://www.leafwindow.com/wordpress-05/wp-content/uploads/2023/12/IMG_6492-20.jpg'
image = load_image(url)

prompt = """
Use the provided image as the background.
A cat holding a sign that says "Gifu AI Study Group" in the foreground,
realistic lighting, seamless integration, high detail
"""

image = pipe(
    prompt=prompt,
    image=image,
    height=1024,
    width=1024,
    guidance_scale=1.0,
    num_inference_steps=4,
    generator=torch.Generator(device=device).manual_seed(0)
).images[0]

image.save("flux-klein-with-input-image.png")

5. Debug 実行のログの例：入力画像をセットした場合の処理内容の確認

Flux2KleinPipeline の __call__ メソッド内の下記のスクリプトで Transformer に hidden_states として渡す latent_model_input が入力画像を指定した場合にどのように変化するかをデバッグ実行で確認しました。

                if image_latents is not None:
                    latent_model_input = torch.cat([latents, image_latents], dim=1).to(self.transformer.dtype)
                    latent_image_ids = torch.cat([latent_ids, image_latent_ids], dim=1)

                with self.transformer.cache_context("cond"):
                    noise_pred = self.transformer(
                        hidden_states=latent_model_input,  # (B, image_seq_len, C)
                        timestep=timestep / 1000,
                        guidance=None,
                        encoder_hidden_states=prompt_embeds,
                        txt_ids=text_ids,  # B, text_seq_len, 4
                        img_ids=latent_image_ids,  # B, image_seq_len, 4
                        joint_attention_kwargs=self.attention_kwargs,
                        return_dict=False,
                    )[0]

                noise_pred = noise_pred[:, : latents.size(1) :]

下記のデバッグ実行のログのように latent_model_input のサイズが、torch.Size([1, 4096, 128]) から、下記のコード実行後に、torch.Size([1, 5946, 128]) となっているのを確認しました。

latent_model_input = torch.cat([latents, image_latents], dim=1).to(self.transformer.dtype)

latents のサイズは torch.Size([1, 4096, 128])、image_latents のサイズは torch.Size([1, 1850, 128]) で、これらを連結したサイズになっています。

ipdb> n
> /content/diffusers/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py(841)__call__()
    840                 latent_model_input = latents.to(self.transformer.dtype)
--> 841                 latent_image_ids = latent_ids
    842 

ipdb> p latent_model_input.shape
torch.Size([1, 4096, 128])
ipdb> n
> /content/diffusers/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py(843)__call__()
    842 
--> 843                 if image_latents is not None:
    844                     latent_model_input = torch.cat([latents, image_latents], dim=1).to(self.transformer.dtype)

ipdb> n
> /content/diffusers/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py(844)__call__()
    843                 if image_latents is not None:
--> 844                     latent_model_input = torch.cat([latents, image_latents], dim=1).to(self.transformer.dtype)
    845                     latent_image_ids = torch.cat([latent_ids, image_latent_ids], dim=1)

ipdb> n
> /content/diffusers/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py(845)__call__()
    844                     latent_model_input = torch.cat([latents, image_latents], dim=1).to(self.transformer.dtype)
--> 845                     latent_image_ids = torch.cat([latent_ids, image_latent_ids], dim=1)
    846 

ipdb> p latent_model_input.shape
torch.Size([1, 5946, 128])
ipdb> p latents.shape
torch.Size([1, 4096, 128])
ipdb> p image_latents.shape
torch.Size([1, 1850, 128])

image_latents を作成する prepare_image_latents のコメントに記載されたサイズと異なるようだったので、prepare_image_latents の先頭にもブレイクポイントをセットして image_latents のサイズがどのように定まるのかを追いました。

今回用いた入力画像のサイズは torch.Size([1, 3, 800, 592]) となっており、高さと幅がそれぞれ 1/16 されて 50, 37 になり、50 x 37 = 1850 で、torch.Size([1, 1850, 128]) の image_latents となっているようでした。

ipdb> n
> /content/diffusers/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py(528)prepare_image_latents()
    527         # Pack each latent and concatenate
--> 528         packed_latents = []
    529         for latent in image_latents:

ipdb> p image_latents[0].shape
torch.Size([1, 128, 50, 37])
ipdb> n
> /content/diffusers/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py(529)prepare_image_latents()
    528         packed_latents = []
--> 529         for latent in image_latents:
    530             # latent: (1, 128, 32, 32)

ipdb> n
> /content/diffusers/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py(531)prepare_image_latents()
    530             # latent: (1, 128, 32, 32)
--> 531             packed = self._pack_latents(latent)  # (1, 1024, 128)
    532             packed = packed.squeeze(0)  # (1024, 128) - remove batch dim

ipdb> n
> /content/diffusers/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py(532)prepare_image_latents()
    531             packed = self._pack_latents(latent)  # (1, 1024, 128)
--> 532             packed = packed.squeeze(0)  # (1024, 128) - remove batch dim
    533             packed_latents.append(packed)

ipdb> p packed.shape
torch.Size([1, 1850, 128])
ipdb> p images[0].shape
torch.Size([1, 3, 800, 592])

1. 概要

2. Debug 実行を試すためのサンプルの Google Colab のページ

3. Debug 実行のログの例

4. 出力画像

5. Debug 実行のログの例：入力画像をセットした場合の処理内容の確認

返信を残す返信をキャンセル