GANji

Real Looking Fake Kanji

Exploring AI Models for Kanji Generation

Author: Chandon Hamel

Organization: Anderson College of Business and Computing, Regis University

Date: March 9, 2025

Repository: GitHub

Introduction

This project combines two of my main interests: deep learning and the Japanese language. As a learner of Japanese, it took me a long time to fully appreciate Kanji characters and the vital role they play in the language as a whole. To me, and many others before studying and learning how to read them, Kanji looked like a collection of jumbled lines and shapes. However, there is a hidden structure and logic to them that reveal their meaning and utility. I believe this made Kanji a particularly interesting and challenging task for training generative AI models. Because I have learned a fair number of Kanji, it is easy for me to visually evaluate how well the models were able to mimic these characters.

Generative AI models like DALL·E, MidJourney, and others have revolutionized the field of AI image generation, making it possible to create stunning visual outputs from text prompts. Curious about how these models work, I decided to explore these techniques myself and learn how to apply them to create something both visually compelling and technical, fake Kanji generation. Once I found a suitable dataset of Kanji images to practice training these generative models, I knew this would be a project I would enjoy.

After reading Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow by Aurelien Geron, I was introduced to three key generative modeling techniques: Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Denoising Diffusion Probabilistic Models (DDPMs). This inspired me to focus the project on evaluating and comparing the performance of these techniques in generating realistic Kanji characters.

Datasets and Preprocessing
Model Creation
- VAE
- GAN
- DDPM
Comparative Analysis
Challenges and Lessons Learned
Applications and Future Work
Conclusion
Credits

Datasets and Preprocessing

Three datasets were used in this project: the main Kanji Dataset retrieved from Kaggle, as well as MNIST and CIFAR-10. The Kanji Dataset served as the primary dataset for training all three generative models, as it contained approximately 10,000 black-and-white images of Kanji characters in a standardized font. These images were well-suited for evaluation due to their consistent structure and relatively small resolution, making them manageable for training on personal hardware.

MNIST and CIFAR-10 datasets were used exclusively with the GAN as secondary datasets for testing purposes. These datasets helped validate the GAN implementation when initial attempts at generating Kanji produced poor results due to my own implementation mistakes. Since MNIST (handwritten digits) and CIFAR-10 (low-resolution color images of objects) are commonly used in generative modeling and are more easily solvable, they provided a more reliable baseline for debugging and ensuring the GAN was working correctly. Once the GAN produced satisfactory results with these datasets, I returned to the Kanji Dataset for further experimentation.

Kanji Dataset Preprocessing

To prepare the data for training, the Kanji images were preprocessed using a custom PyTorch Dataset class. This applied a composite transformation to the data to prepare it for input into the models. First, all Kanji images were resized to 64x64 pixels to ensure uniform input size across the models and accommodate deeper pooling layers in the neural networks. Next, pixel values were normalized to meet the requirements of each model:

For the VAE, pixel values were scaled to the range [0, 1].
For the GAN and DDPM, pixel values were normalized to the range [-1, 1].

A random horizontal flip was applied to the Kanji images as a data augmentation technique. While Kanji characters are generally asymmetrical, the augmentation effectively doubled the training dataset size and introduced variation that improved the models’ generalization. This trade-off between introducing distortion and increasing data diversity was deemed worthwhile.

See the KanjiDataset class in VAE/main.py as well as the transform variable.

CIFAR-10 and MNIST Preprocessing

When using CIFAR-10 and MNIST datasets for GAN testing, a similar preprocessing approach was applied. CIFAR-10 images, being natural color images, were normalized with three channels ([R, G, B]) to the range [-1, 1], while MNIST grayscale images used a single channel. Both datasets were resized to 64x64 pixels to match the input size expected by the GAN architecture. These preprocessing steps ensured compatibility between the datasets and the implemented GAN, enabling seamless transitions between debugging and Kanji experiments.

Data Loading

The preprocessed images were loaded using a PyTorch DataLoader, which enabled batch processing and efficient feeding of data to the models. For the Kanji Dataset, a custom Dataset class was used to load the images from disk, perform transformations, and apply horizontal flips dynamically during training. For CIFAR-10 and MNIST, the respective PyTorch datasets provided built-in functionality to handle download and loading. The DataLoader was configured with multi-threaded workers (num_workers=4), memory pinning (pin_memory=True), and prefetching (prefetch_factor=2) to optimize the throughput of data pipelines, minimizing I/O bottlenecks during training.

See the dataloader variable in VAE/main.py.

Sample Kanji Data

Model Creation

As mentioned previously, three techniques were used in this generative AI exploration. I started with the most straightforward approach and moved to more complex techniques once satisfactory results were achieved. This progression allowed for early successes but quickly transitioned into periods of exasperation as the complexity increased.

The first model I attempted, the Variational Autoencoder (VAE), produced decent results almost immediately, within about three hours of starting the project. Its relatively simple architecture and objective made it an excellent starting point for the exploration.

The Generative Adversarial Network (GAN), on the other hand, was much more finicky. I spent many hours over several days troubleshooting why it wouldn’t produce good results, even on the simpler test datasets like MNIST and CIFAR-10. Eventually, I found the issue with my configuration and made the necessary adjustments. While the process was frustrating, it taught me valuable lessons about GAN instability, debugging complex models, and the importance of iterative introduction of complexity.

Finally, I implemented the Denoising Diffusion Probabilistic Model (DDPM). While the results it produces are promising, I am not yet completely satisfied with my implementation and believe there is significant room for improvement. There are certainly improvements I can make to all of my models, but the DDPM currently feels furthest from satisfactory. As of writing this, the DDPM results are encouraging, and I look forward to further exploring and refining this technique in the future.

Variational AutoEncoder (VAE)

Overview

The Variational AutoEncoder (VAE) is a generative model that combines elements of probability and deep learning to create structured outputs. It is composed of two main components: an encoder, which compresses input data into a continuous latent space, and a decoder, which reconstructs the original input based on sampled points from this latent space. Unlike traditional (non-variational) autoencoders, VAEs incorporate a probabilistic latent representation. Specifically, the encoder outputs parameters, typically the mean (mu) and log-variance (logvar), that define a Gaussian distribution, allowing the decoder to sample diverse yet meaningful variations.

VAEs were an ideal starting point for this project because of their simplicity and well-defined architecture. Unlike other generative techniques, VAEs are straightforward to implement and train, requiring only a combination of reconstruction loss (e.g., binary cross-entropy) and regularization via the Kullback-Leibler (KL) divergence in the latent space. These properties also make VAEs relatively stable during training, even with limited computational resources. The early success obtained from implementing a VAE built confidence and set the foundation for exploring more complex generative models later in the project.

Architecture

Two main design patterns were implemented for the VAE architecture: one without residual connections and one with a ResNet-like design. The simpler architecture without residual connections served as a good starting point, but it quickly became evident that a ResNet-like architecture performed significantly better. The ResNet-inspired design was chosen based on prior success with similar architectures for classification tasks and because of its ease of implementation. Additionally, the encoder-decoder nature of the VAE naturally benefits from a symmetrical structure centered around the latent space which was straightforward to implement using this design paradigm.

Both the encoder and decoder were composed of seven residual blocks. The encoder blocks progressively reduced the spatial dimensions of the input, with convolutional channels increasing from 64 to 512, while the decoder symmetrically reversed this process with convolutional channels decreasing from 512 to 64. Each residual block included two convolutional layers with a kernel size of 3x3, batch normalization, and ReLU activations, as well as a shortcut connection to stabilize training and improve gradient flow. Downsampling in the encoder was achieved via strides in the convolutional layers while upsampling in the decoder was implemented using transposed convolutions with similar strides and an output padding of 1 when required.

At the bottleneck, the encoder flattened the output of the final residual block into a 1D vector, which was passed through two fully connected layers to produce the mean (mu) and log-variance (logvar) vectors for the latent space. These vectors defined the Gaussian distribution from which latent variable samples were drawn during training using the reparameterization trick. The latent dimension was set to 256 for this project, which was found to work well.

The decoder mirrored the encoder architecture, with its input being the flattened latent vector transformed back into a 3D tensor using a fully connected layer followed by an unflattening operation. The decoder’s residual blocks progressively upsampled the spatial dimensions while reducing the number of channels symmetrically. A final series of convolutional layers processed the decoded output to match the original image size, with the final layer producing a single grayscale channel.

This ResNet-inspired VAE architecture demonstrated stability during training and effectively captured the complex structures of Kanji for generating plausible reconstructions. Its symmetric encoder-decoder design, combined with residual connections, proved crucial for the model’s success.

Files for code:

Training process

Training the Variational AutoEncoder (VAE) was implemented using PyTorch Lightning for efficient and modular workflow management. The VAE is trained to minimize a combination of two loss terms: the reconstruction loss and the Kullback-Leibler (KL) divergence. The reconstruction loss, calculated using binary cross-entropy with logits, measures how accurately the decoder reconstructs the input images from their latent representations. The KL divergence regularizes the latent space by encouraging the learned distribution to match a Gaussian prior. The total loss for each batch is the sum of these two terms, ensuring both high-quality reconstructions and a well-structured latent space.

For optimization, the Adam optimizer was used, with learning rate scheduling handled by a OneCycleLR scheduler. This scheduler smoothly adjusts the learning rate over the training process, starting small, increasing to a peak of 1e-3, and then gradually decreasing, which helps stabilize training and improve convergence.

During training, the model logs the total training loss (train_loss) for tracking performance across epochs. The model also generates sample reconstructions from random latent vectors at regular intervals (every “sample_every” epochs). These samples are saved as grids of images, providing visual feedback on the VAE’s ability to produce realistic and diverse Kanji-like characters.

The training loop is managed using PyTorch Lightning’s Trainer class, which automates key tasks like device management, checkpointing, and logging. A bf16-mixed precision was used to speed up training while reducing memory usage on compatible GPUs.

Files for code:

VAE/main.py

Hurdles and Lessons Learned

As stated previously, this method worked well early on, and there weren’t many significant hurdles. The main challenge was ensuring that the Conv2d and ConvTranspose2d layers produced tensors of the correct size, upsampling and downsampling as intended. The equation provided in the PyTorch documentation for calculating the height and width of the output tensor proved invaluable in resolving these issues.

Additionally, the blocks created for the encoder and decoder were not only effective for the VAE but also served as a solid foundation for designing the architectures of the GAN and DDPM later in the project. This reusability of components significantly streamlined the implementation of the other models.

Results


VAE epoch 10	VAE epoch 20	VAE epoch 100

These images were inverted during the sampling process, a post-processing step that was only applied to the VAE samples. When testing the GAN with the CIFAR-10 dataset, I found this inversion to be undesirable, so it was omitted for other models.

After just 10 epochs, the sampled images already begin to resemble Kanji. While they are still quite blurred, the foundational shapes of the characters are evident. By epoch 20, the images show significant improvement, with much clearer structures, though some areas remain blurry and lack fine detail. After 100 epochs, the reconstructions have improved considerably, and the images appear well-formed and structured.

Although the final outputs are not convincing enough to fool someone into believing they are part of the training dataset, the amount of structure the model can reconstruct from the latent space is impressive.