FLUX.2 [klein] 4B の調査 – Pipeline の構成と Scheduler について

1. 概要

こちらのリンク先の Google Colab のコードセルを実行し、print 文で FLUX.2 Klein 4B の Pipeline の構成を調べました。また、こちらのリンク先と同様の手順でデバッグ実行し、Scheduler の処理内容を調査しました。

2. diffusers の FLUX.2 Klein 4B の Pipeline の構成

こちらのリンク先の Google Colab のコードセルを順に実行します。下記のコードセルを実行すると FLUX.2 Klein 4B による画像生成の多段階処理を管理する Pipeline のインスタンスが生成されます。

import torch
from diffusers import Flux2KleinPipeline

dtype = torch.bfloat16
pipe = Flux2KleinPipeline.from_pretrained("black-forest-labs/FLUX.2-klein-4B", torch_dtype=dtype)

下記のコードセルを実行し、Pipeline のインスタンスの構成を出力します。

print(pipe)

出力結果は下記のようになりました。

Flux2KleinPipeline {
  "_class_name": "Flux2KleinPipeline",
  "_diffusers_version": "0.37.0.dev0",
  "_name_or_path": "black-forest-labs/FLUX.2-klein-4B",
  "is_distilled": true,
  "scheduler": [
    "diffusers",
    "FlowMatchEulerDiscreteScheduler"
  ],
  "text_encoder": [
    "transformers",
    "Qwen3ForCausalLM"
  ],
  "tokenizer": [
    "transformers",
    "Qwen2Tokenizer"
  ],
  "transformer": [
    "diffusers",
    "Flux2Transformer2DModel"
  ],
  "vae": [
    "diffusers",
    "AutoencoderKLFlux2"
  ]
}

プロンプトを入力とし、ノイズから画像を生成する処理は下記の順に進みます。

Prompt
  ↓
Qwen2Tokenizer
  ↓
Qwen3ForCausalLM（Text Encoder）
  ↓
Text Embeddings（固定・全ステップ共有）
  ↓
Latent（初期ノイズ）
  ↓
[反復]
  Flux2Transformer2DModel（条件付き予測）
  + FlowMatchEulerDiscreteScheduler（更新）
  ↓
Latent（収束）
  ↓
AutoencoderKLFlux2.decode
  ↓
Image

3. diffusers の FLUX.2 Klein 4B の Pipeline の構成要素

3.1. transformer

"transformer": [
  "diffusers",
  "Flux2Transformer2DModel"
]

画像生成のメイン計算部分です。このクラスによる計算は、Denoising loop 内で繰り返し適用されます。

3.2. scheduler

"scheduler": [
  "diffusers",
  "FlowMatchEulerDiscreteScheduler"
]

このクラスによる計算も Denoising loop 内で繰り返し適用されます。Denoising loop 内の下記のコードで transformer の出力 noise_pred、時刻 t、更新前の潜在空間での画像情報 latents を受け取り、更新後の latents を出力します。

latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]

noise_pred という変数名になっていますが、Flow Matching の scheduler を使用しており、transformer からの出力 noise_pred は、latents を latents = latents + dt * velocity の式で更新していく velocity に対応する多次元配列を保持しています。

3.3. text encoder

"text_encoder": [
  "transformers",
  "Qwen3ForCausalLM"
]

テキストエンコーダには Qwen3 を使っています。

3.4. tokenizer

"tokenizer": [
  "transformers",
  "Qwen2Tokenizer"
]

Qwen 形式のトークナイザーを使用しています。

3.5. vae

"vae": [
  "diffusers",
  "AutoencoderKLFlux2"
]

FLUX.2専用のVAEを使っています。

3.6. is_distilled について

“is_distilled”: true となっていますが、こちらは FLUX.2-klein-4B の代わりに FLUX.2-klein-base-4B を指定して Pipeline のインスタンスを作成した場合、false になることを確認しています。

import torch
from diffusers import Flux2KleinPipeline

dtype = torch.bfloat16
pipe = Flux2KleinPipeline.from_pretrained("black-forest-labs/FLUX.2-klein-base-4B", torch_dtype=dtype)

is_distilled は、Flux2KleinPipeline クラスにおいて下記のように参照されていて、1.0 より大きな guidance_scale を指定し、生成される画像を制御する際に参照されます。is_distilled が true の FLUX.2-klein-4B では、guidance_scale に 1.0 より大きな値を指定しても無視されるようです。is_distilled が false の FLUX.2-klein-base-4B では、guidance_scale に 1.0 より大きな値をセットすることで、プロンプトにより忠実な画像を生成することができるようです。ただし、生成される画像の質は低下するようです。

        if guidance_scale > 1.0 and self.config.is_distilled:
            logger.warning(f"Guidance scale {guidance_scale} is ignored for step-wise distilled models.")

    @property
    def guidance_scale(self):
        return self._guidance_scale

    @property
    def do_classifier_free_guidance(self):
        return self._guidance_scale > 1 and not self.config.is_distilled

4. FLUX.2 の scheduler の処理内容の調査

こちらのリンク先と同様の手順で diffusers で FLUX.2 用の Pipeline を使って画像を生成するスクリプトをデバッグ実行し、scheduler の処理内容を調査しました。

上記のリンク先とは異なり、下記のように Denoising loop から self.scheduler.step(noise_pred, t, latents, return_dict=False)を呼び出す直前にブレイクポイントをセットして調査しました。

diff --git a/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py b/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py
index 936d2c380..841db95fb 100644
--- a/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py
+++ b/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py
@@ -870,6 +870,9 @@ class Flux2KleinPipeline(DiffusionPipeline, Flux2LoraLoaderMixin):

                 # compute the previous noisy sample x_t -> x_t-1
                 latents_dtype = latents.dtype
+
+                import ipdb; ipdb.set_trace()
+
                 latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]

                 if latents.dtype != latents_dtype:

下記のコードセルのように num_inference_steps=4 を指定し、4ステップで画像生成する条件で調査しました。

device = "cuda"
prompt = 'A cat holding a sign that says "Gifu AI Study Group"'

image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    guidance_scale=1.0,
    num_inference_steps=4,
    generator=torch.Generator(device=device).manual_seed(0)
).images[0]

image.save("flux-klein.png")

self.scheduler.step(noise_pred, t, latents, return_dict=False) が実行され、FlowMatchEulerDiscreteScheduler の下記の step メソッドが呼び出されると、下記のハイライトした行が実行されるのを確認しました。(「…」で記載を省略した部分にも実行されるコードが一部あります。)

    def step(
        self,
        model_output: torch.FloatTensor,
        timestep: float | torch.FloatTensor,
        sample: torch.FloatTensor,
        ...
        return_dict: bool = True,
    ) -> FlowMatchEulerDiscreteSchedulerOutput | tuple:
        """
        ...
        Args:
            model_output (`torch.FloatTensor`):
                The direct output from learned diffusion model.
            timestep (`float`):
                The current discrete timestep in the diffusion chain.
            sample (`torch.FloatTensor`):
                A current instance of a sample created by the diffusion process.
            ...
            return_dict (`bool`, defaults to `True`):
                Whether or not to return a
            ...
        """
        ...

        if per_token_timesteps is not None:
            ...
        else:
            sigma_idx = self.step_index
            sigma = self.sigmas[sigma_idx]
            sigma_next = self.sigmas[sigma_idx + 1]

            current_sigma = sigma
            next_sigma = sigma_next
            dt = sigma_next - sigma

        if self.config.stochastic_sampling:
            x0 = sample - current_sigma * model_output
            noise = torch.randn_like(sample)
            prev_sample = (1.0 - next_sigma) * x0 + next_sigma * noise
        else:
            prev_sample = sample + dt * model_output

        # upon completion increase step index by one
        self._step_index += 1
        if per_token_timesteps is None:
            # Cast sample back to model compatible dtype
            prev_sample = prev_sample.to(model_output.dtype)

        if not return_dict:
            return (prev_sample,)

        return FlowMatchEulerDiscreteSchedulerOutput(prev_sample=prev_sample)

sigma = self.sigmas[sigma_idx] および sigma_next = self.sigmas[sigma_idx + 1] で self.sigmas に格納された値を順に参照し、dt = sigma_next - sigma でその差を dt にセットしています。

最初に step メソッドが呼ばれたときの sigma_idx の値は 0 で、self._step_index += 1 で、step メソッドが呼ばれるごとに 1 ずつ増加していきます。

prev_sample = sample + dt * model_output で更新前の latents の多次元配列の値 sample に dt * model_output を加えて、更新後の latents の値 prev_sample を計算しています。model_output は transformer が出力した多次元配列で Flow Matching の scheduler が latents を更新する velocity に対応しています。

FlowMatchEulerDiscreteScheduler の sigmas 配列の値は、Flux2KleinPipeline の __call__ メソッドで Denoising loop が始まる前に下記のコードでセットされます。

        # 6. Prepare timesteps
        sigmas = np.linspace(1.0, 1 / num_inference_steps, num_inference_steps) if sigmas is None else sigmas
        if hasattr(self.scheduler.config, "use_flow_sigmas") and self.scheduler.config.use_flow_sigmas:
            sigmas = None
        image_seq_len = latents.shape[1]
        mu = compute_empirical_mu(image_seq_len=image_seq_len, num_steps=num_inference_steps)
        timesteps, num_inference_steps = retrieve_timesteps(
            self.scheduler,
            num_inference_steps,
            device,
            sigmas=sigmas,
            mu=mu,
        )

上記のスクリプトの 1 行目で sigmas には [1. , 0.75, 0.5 , 0.25] がセットされます。num_inference_steps は Pipeline の __call__ メソッドの引数で、今回は 4 がセットされていることを仮定しています。

retrieve_timesteps 関数から、scheduler.set_timesteps(sigmas=sigmas, device=device, **kwargs) が呼ばれ、このメソッド内で下記のスクリプトが実行されます。

        # 2. Perform timestep shifting. Either no shifting is applied, or resolution-dependent shifting of
        #    "exponential" or "linear" type is applied
        if self.config.use_dynamic_shifting:
            sigmas = self.time_shift(mu, 1.0, sigmas)

上記のコード実行後に sigmas 配列の値が [1. , 0.96738404, 0.90814394, 0.76719993] に更新されることをデバッグ実行で確認しました。以下は確認したデバッグ実行のログになります。

ipdb> n
> /content/diffusers/src/diffusers/schedulers/scheduling_flow_match_euler_discrete.py(348)set_timesteps()
    347         if self.config.use_dynamic_shifting:
--> 348             sigmas = self.time_shift(mu, 1.0, sigmas)
    349         else:

ipdb> p sigmas
array([1.  , 0.75, 0.5 , 0.25], dtype=float32)
ipdb> p mu
2.291179894115571
ipdb> n
> /content/diffusers/src/diffusers/schedulers/scheduling_flow_match_euler_discrete.py(353)set_timesteps()
    352         # 3. If required, stretch the sigmas schedule to terminate at the configured `shift_terminal` value
--> 353         if self.config.shift_terminal:
    354             sigmas = self.stretch_shift_to_terminal(sigmas)

ipdb> p sigmas
array([1.        , 0.96738404, 0.90814394, 0.76719993], dtype=float32)

4 回の scheduler.step の呼び出しで、dt は -0.0326, -0.0592, -0.1409, -0.7672 のようにより絶対値の大きな負の値を取るように変化しました。 - 0.0326 - 0.0592 - 0.1409 - 0.7672 = - 0.9999で dt の和は約 -1 になります。最初の呼び出しの sigma は 1.0 で、最後 (4回目) の呼び出しの sigma_next は 0 でした。

FLUX.2 の scheduler の計算は下記の積分の計算に対応しています。

$$
x(0) = x(1) + \int_1^0 v(x(t), t) dt
$$

ここで、$x(1)$は初期ノイズのlatent(潜在空間での画像情報)、$x(0)$は生成画像のlatentです。dt を 1 から 0 まで積分すれば下記のように -1 になるため、上記の 4 ステップの計算の dt の和は約 -1 になっています。

$$
\int_1^0 dt = -1
$$

4 ステップの計算で、初期ノイズに近いときは少しずつ、最後の 4 ステップ目は dt の絶対値を大きく取って生成画像に近付けています。

scheduler.step(noise_pred, t, latents, return_dict=False) の二つ目の引数 t は、今回実行した条件では、step メソッド内で参照されることはありませんでした。

5. FLUX.2 の scheduler の処理内容の確認

下記の画像生成のコードセルを num_inference_steps を 1, 2, 3, 4, 6, 10 と変え、生成される画像を確認しました。また、このとき scheduler の sigmas と timesteps がどのような値になるかも確認しました。

下記のスクリプトでは generator=torch.Generator(device=device).manual_seed(0) のように乱数の種を固定しているため、乱数を使って最初に用意される潜在空間での画像情報 latent の値は同じです。

device = "cuda"
prompt = 'A cat holding a sign that says "Gifu AI Study Group"'

image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    guidance_scale=1.0,
    num_inference_steps=4,
    generator=torch.Generator(device=device).manual_seed(0)
).images[0]

image.save("flux-klein.png")

この条件ですと、num_inference_steps が3以上であれば生成される画像の質は大きくは変わらなさそうでした。

5.1. num_inference_steps=1 のとき

sigmas = [1., 0.]
timesteps = [1000.]

5.2. num_inference_steps=2 のとき

sigmas = [1.0000, 0.9091, 0.0000]
timesteps = [1000.0000, 909.1107]

5.3. num_inference_steps=3 のとき

sigmas = [1.0000, 0.9521, 0.8326, 0.0000]
timesteps = [1000.0000, 952.1271, 832.5565]

5.4. num_inference_steps=4 のとき

sigmas = [1.0000, 0.9674, 0.9081, 0.7672, 0.0000]
timesteps = [1000.0000, 967.3840, 908.1439, 767.2000]

5.5. num_inference_steps=6 のとき

sigmas = [1.0000, 0.9799, 0.9513, 0.9072, 0.8301, 0.6615, 0.0000]
timesteps = [1000.0000, 979.9441, 951.3246, 907.1679, 830.1074, 661.5250]

5.6. num_inference_steps=10 のとき

sigmas = [1.0000, 0.9885, 0.9745, 0.9570, 0.9347, 0.9052, 0.8642, 0.8036, 0.7047, 0.5148, 0.0000]
timesteps = [1000.0000, 988.4957, 974.4825, 957.0386, 934.7291, 905.1879, 864.2187, 803.6000, 704.7356, 514.7510]