My Journey into Training Transformer Models on FOREX: Lessons and Insights
Earlier this year, I completed a post graduate program in Artificial Intelligence and Machine Learning from the University of Texas in collaboration with Great Learning. After the program, I embarked on a hands-on project to train a model using the transformer architecture — a journey that’s been both fascinating and challenging.
My objective was to gain practical experience by building a model that could predict the next set of candles on a FOREX chart, specifically EUR/USD 10-minute intervals. This project allowed me to explore various aspects of AI model development, from data acquisition to model training on GPU clusters.
Key Highlights of the Journey:
1. Data Acquisition: The first hurdle was gathering sufficient data for training. I aimed to collect EUR/USD 10-minute interval data dating back as far as possible. Surprisingly, even after collecting data from available sources, I could only gather just under a million candles. This underscored the challenge of obtaining large, high-quality datasets for financial markets. Despite this limitation, I decided the available data would suffice for the project’s scope. Along the way, I also encountered and resolved technical challenges related to data processing, such as distinguishing when to use multiprocessing versus multithreading — an essential skill in efficiently handling large datasets.
2. Transformer Model Code Base: To implement the transformer architecture, I turned to Andrej Karpathy’s nanoGPT code base, which served as an excellent starting point. This open-source project provided a solid foundation for building and training transformer models. This experience also emphasized the importance of leveraging existing resources and focusing on refining and extending them to meet specific project needs. Here is a link to his tutorial on building GPT from scratch https://youtu.be/kCc8FmEb1nY?si=2tArxqZlTM0hP952
3. GPU for Training: Initially, I attempted to train the model on a CPU, quickly realizing that this approach was not feasible due to the extensive time required. The next logical step was to move to a GPU-based environment. Google Colab became my go-to platform, particularly because of its ease of use. I recommend it especially for those less familiar with setting up cloud environments and CUDA. However, as my model complexity grew, so did the need for more GPU memory. This led me to upgrade to Colab Pro, where I optimized the dataset to fit within the available memory constraints. Despite these efforts, I soon faced the limitations of GPU availability and memory, prompting a shift to other platforms.
4. Hyperparameter Tuning: One of the most critical aspects of training a transformer model is fine-tuning its hyperparameters. I experimented with adjusting batch_size, block_size, n_embed, n_head, and n_layer to observe their impact on training efficiency and model accuracy. Through this iterative process, I was able to reduce the model’s loss from 5.0 to approximately 1.0 — a significant improvement. I also realised a form of hallucination occurs when the context length of data generated by the model is beyond its training block size. However, the constraints of my training environment, particularly on Google Colab, such as insufficient A100 GPU memory and frequent disconnections, posed challenges in maintaining a consistent workflow. This realization led me to explore more stable and powerful computing environments.
5. Exploring GPU Providers: As my need for reliable GPU resources grew, I explored various platforms, including Hyperstack and Vast.ai. These platforms offered bare-metal GPU access, but the time and cost associated with setting up the environment on these machines were considerable. This experience highlighted the trade-offs between cost, setup time, and performance. A breakthrough came when I received $1000 in Azure credits through Microsoft’s startup program called Microsoft for startups founders, which allowed me to scale my resources effectively. However, even with this support, I encountered the financial reality of high computational costs, as setting up the environment alone consumed a significant portion of the credits.
6. Training on Single vs. Multiple GPUs: When my Azure GPU quota was increased to allow the use of four GPUs, I encountered the complexities of training on multiple GPUs. Unlike single-GPU training, where the model and data are straightforwardly loaded onto one GPU, multi-GPU training requires careful consideration of how to distribute the model and data. I experimented with data parallelism and model parallelism, eventually settling on Distributed Data Parallelism (DDP) as the most effective strategy. DDP allowed me to distribute the training data across multiple GPUs while ensuring that each GPU operated on a synchronized copy of the model. This approach significantly accelerated the training process and improved the model’s performance, demonstrating the power of parallel computing in deep learning. Here is a result of the model:
What’s Next?
I hope these insights are helpful, especially for those just starting their journey into deep learning. Moving forward, I plan to delve deeper into the effects of learning rates on model performance. While my Azure credits are depleted, I’m optimistic about securing more resources to continue this work. In the meantime, I’ll be sharing real-time predictions of the EUR/USD 10M chart from the model on my Twitter/X and Instagram pages @abiodunaremung.
Stay tuned for more updates, and feel free to connect if you’re interested in discussing AI, machine learning, or transformer models!