Transformers対LSTM株価時系列予測

概要

最初のブログ投稿で始めた株価時系列問題についての議論を続けていきたいと思います。最初の投稿では、LSTMとCNN-LSTMモデルアーキテクチャを使用して、過去の価格のみを入力として将来の株価を予測し、かなり良い結果を得ました。平均絶対パーセンテージ誤差は3%以下で、順調に見えていました。その投稿で、この数値を改善する方法について議論すると言いましたが、最初は異なるハイパーパラメータを持つLSTMネットワークを使ってアンサンブルLSTMモデルを作成し、それらの予測を平均化してより良い結果を得ることを考えていました。しかし、別のより興味深い研究路線を追求することに脱線してしまいました。その新しいアプローチを一言で言うと、Transformersです。

Transformerモデルアーキテクチャとは何か？

問題の詳細に入る前に、少し説明があると良いと思いました。ただし、transformersは私がまだ足を濡らし始めたばかりの高レベルな機械学習概念であり、私は専門家からは程遠いことを注記しておきます。しかし、教えること（または試すこと）は学習の重要な部分なので、進歩の精神で、始めましょう。

非常に高いレベルでは、transformerは「自己注意メカニズム」を活用して訓練時間とパラメータ数を削減しながら高い予測性能を維持するフィードフォワードニューラルネットワークアーキテクチャです。私がこれを概念化している方法は次のとおりです：transformersは、すべての再帰が訓練を遅くしてパラメータを追加することなく、RNNの仕事をすることができます。これは次のように機能します：transformerのマルチヘッド注意メカニズム（TensorFlowのtf.keras.layers.MultiHeadAttentionで実装）により、モデルは特定の他のデータポイントに関連するすべてのデータポイントを追跡できます。私たちの問題の文脈に置くと、明日の株価を予測しようとする場合、今日の価格は明らかに関連していますが、他の価格も同様です。例えば、今週の価格動向は3ヶ月前に起こったことと非常に似ている可能性があり、したがって私たちのモデルがその価格動向を「記憶」できると有益です。この「記憶」の必要性が、そもそも私たちが再帰ニューラルネットワークとLSTM（長短期記憶）モデルを使用することを選んだ理由の一つでした。以前のデータポイントをモデルに逆向きに供給することにより、RNNは以前の値に重みを付けることができ、この記憶効果を与えます。transformersを異なるものにするのは、その独特の自己注意メカニズムにより、すべての再帰なしに同じ「記憶」効果を持つことができることです：データは一度だけ供給され、自己注意は各ステップで関連するデータポイントを追跡します（例えば、3ヶ月前に現在見ているものと似ていた価格動向に重みを付けます）。これは、データを一度だけ供給するためtransformersがRNNよりもはるかに速く訓練され、必要なパラメータも大幅に少なくなることを意味し、ウィンウィンです。transformersについてのより一貫した専門的な説明については、この投稿の最後にそれらを説明する元の論文をリンクします。しかし今のところ、私たちの問題に進みましょう。

株価時系列問題

この調査が答えようとしている質問は、「過去の株価だけを使って、今後5日間の株価を予測できるか？」と簡単に表現できます。上で言及したように、私たちはすでにこの質問に暫定的な「はい」で部分的に答えています。しかし、私たちが構築したモデルは、驚くほど有能でありながら、完璧でもなく、特に有用でもありません：平均して3.5%オフというのは、それに基づいて取引することに自信を持つには十分ではありません。これの一部は、問題が単純に困難で、データがノイズが多く、ほぼランダムであるためです。株式市場がランダムウォークと呼ばれるのには理由があります。しかし、それでも私たちは屈することなく、より良いモデルの追求において以下のステップを進めていきます：

transformersの性能を評価するための信頼できるベースラインLSTMモデルを確立する
transformerモデルを構築する
モデルを訓練し評価する

パート1：LSTM

ベースラインモデルを構築しましょう。以前の実験から、3層、200ニューロンのLSTM層、および50と1ニューロンの2つの密結合ノードを持つことが、この問題に非常に効果的であることがわかりました。これを実装するコードは次のとおりです：

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM

def build_lstm(etl: ETL, epochs=25, batch_size=32) -> tuple[tf.keras.Model, tf.keras.History]]:
  """
  ベースラインLSTMモデルを構築、コンパイル、学習します。
  """
  n_timesteps, n_features, n_outputs = 5, 1, 5 
  callbacks = [tf.keras.callbacks.EarlyStopping(patience=10,
                                                restore_best_weights=True)]
  model = Sequential()
  model.add(LSTM(200, activation='relu', 
                 input_shape=(n_timesteps, n_features)))
  model.add(Dense(50, activation='relu'))
  model.add(Dense(n_outputs))
  print('ベースラインモデルをコンパイル中...')
  model.compile(optimizer='adam', loss='mse', metrics=['mae', 'mape'])
  print('モデルを学習中...')
  history = model.fit(etl.X_train, etl.y_train,
                      batch_size=batch_size, 
                      epochs=epochs,
                      validation_data=(etl.X_test, etl.y_test),
                      verbose=1, 
                      callbacks=callbacks)
  return model, history

これがコンパイル後のモデル要約に関連するモデルです。

パート2：Transformer

transformerアーキテクチャについては、Kerasドキュメントで推奨されている構成を使用します（投稿の最後にリンクします）が、この設定は分類用に構築されているため、小さな変更を加えます。最終出力層の活性化関数をsoftmaxからreluに変更し、損失関数を平均二乗誤差に変更します。それ以外では、実験によってこの問題に最適であると分かったハイパーパラメータを設定しました。以下が結果です：

def transformer_encoder(inputs, head_size, num_heads, ff_dim,
                        dropout=0, attention_axes=None):
  """
  単一のtransformerブロックを作成します。
  """
  x = layers.LayerNormalization(epsilon=1e-6)(inputs)
  x = layers.MultiHeadAttention(
      key_dim=head_size, num_heads=num_heads, dropout=dropout,
      attention_axes=attention_axes
      )(x, x)
  x = layers.Dropout(dropout)(x)
  res = x + inputs

    # フィードフォワード部分
  x = layers.LayerNormalization(epsilon=1e-6)(res)
  x = layers.Conv1D(filters=ff_dim, kernel_size=1, activation="relu")(x)
  x = layers.Dropout(dropout)(x)
  x = layers.Conv1D(filters=inputs.shape[-1], kernel_size=1)(x)
  return x + res

def build_transfromer(head_size, 
                      num_heads,
                      ff_dim,
                      num_trans_blocks,
                      mlp_units, dropout=0, mlp_dropout=0) -> tf.keras.Model:
  """
  多くのtransformerブロックを構築して最終モデルを作成します。
  """
  n_timesteps, n_features, n_outputs = 5, 1, 5 
  inputs = tf.keras.Input(shape=(n_timesteps, n_features))
  x = inputs 
  for _ in range(num_trans_blocks):
    x = transformer_encoder(x, head_size, num_heads, ff_dim, dropout)

  x = layers.GlobalAveragePooling1D(data_format="channels_first")(x)
  for dim in mlp_units:
    x = layers.Dense(dim, activation="relu")(x)
    x = layers.Dropout(mlp_dropout)(x)

  outputs = layers.Dense(n_outputs, activation='relu')(x)
  return tf.keras.Model(inputs, outputs)

transformer = build_transfromer(head_size=128, num_heads=4, ff_dim=2, 
                                num_trans_blocks=4, mlp_units=[256], 
                                mlp_dropout=0.10, dropout=0.10, 
                                attention_axes=1)

transformer.compile(
    loss="mse",
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
    metrics=["mae", 'mape'],
)

callbacks = [tf.keras.callbacks.EarlyStopping(patience=10, 
                                              restore_best_weights=True)]

t_hist = transformer.fit(data.X_train, data.y_train, batch_size=32,
                         epochs=25, validation_data=(data.X_test, data.y_test),
                         verbose=1, callbacks=callbacks)

このコードは、以下のようなモデルを構築します（num_trans_blocks回繰り返されます）。

これは1つのtransformerブロックで、私たちのモデルはこれを4つスタックします。

最終的に、私たちのtransformerブロックは合計17,205個のパラメータを持ち、LSTMの10分の1強です。それでは訓練を始めましょう。

パート3：訓練と評価

両方のネットワークをAdamオプティマイザーで25エポック訓練します。まず、テストデータでのベースラインモデルの性能について議論しましょう。最初に注目したいのは、LSTMが驚くほど一貫していることです。つまり、10回連続で訓練すると、すべての10ケースでテストセットでほぼ同じ性能を示す予測を提供します。また、期待していたよりも早く訓練され、全体で143秒でした。推論時間も優秀で、わずか22秒でした。これは、少なくともこの比較において、LSTMが前述のように171,000以上のパラメータを持つ重量級であることを考えると、より印象的です。

LSTMの予測の可視化

予測能力に関しては、LSTMが有能であることがわかります（あまり驚くことではありません、理由があってベースラインなのです）。テストセットでMAPE 2.44%のスコアを記録しました。全体的に、LSTMは株価時系列データの予測に優れた一貫性のある、簡単に訓練できるモデルです。唯一の制限は大きいことで、したがって簡単にスケールできません。しかし、株価に関しては、非常に深いモデルを構築するのに十分なデータがありません。そうすると実際に性能が低下し始めます。したがって、重いパラメータ数がLSTMをあまり傷つけるとは思いません。

Transformerのグラフ

次に、transformerについて議論しましょう。最初に注目したいのは、transformerがLSTMよりも不安定であることです。どういう意味でしょうか？同じモデル（同じハイパーパラメータ）を何度も何度も訓練しました。結果は？私が構築した株価予測の最高のモデルを訓練し、MAPE 2.37%を記録しました。もう1つをMAPE 2.41%で構築しました。これらは両方とも、一貫して2.45%-2.5%あたりを記録していたベースラインよりも優れていることに注目してください。しかし悲しいことに、transformerはそれら2つの黄金の訓練実行と同じハイパーパラメータでも、その性能を一貫して再現することができませんでした。さらに、transformerが少し悪い程度ではありませんでした。MAPEが3%を超えることもありました。これは、毎月または四半期ごとにモデルを構築して再訓練しようとしている場合の問題です。強力なtransformerモデルを持ち、それを再訓練して、ゴミの山に残される可能性があります。その点で、transformerは理想的ではありませんでした。

では、transformerが本当に輝いたのはどこでしょうか？パラメータ数です。LSTMのパラメータ数の10分の1強でした。これは、transformerがLSTMよりも速く訓練されることを意味し、LSTMの143秒に対してわずか138秒でした。しかし驚くことに、LSTMの方が推論が速かったです：transformerは全体で25秒かかりました。

最後に、実際の値に対する予測の相対分散、つまり予測にどれだけの不確実性が組み込まれているかに関して、LSTMがtransformerを上回りました：2.4% 対 2.6%でした。

結論

まず第一に、transformerの性能は本当に印象的でした。わずか約17,000パラメータで、170,000以上のLSTMに匹敵することができました。それは並大抵のことではありません。しかし、再訓練に直面して予測不可能で不安定でした。これがtransformersの一般的な特徴なのか、私の実装と使用法の問題なのかはわかりませんが、確実に非常に目立ちました。大言語モデルを訓練している場合、これは問題ではないかもしれません。英語は数週間ごとに新しいモデルを出さなければならないほど急速に変化しません。しかし、株価の予測では、新しいデータが流入するにつれて再訓練する能力、以前は非常に有能だったモデルが今では単に平凡になってしまったことを心配することなく、それは非常に重要になります。その理由と、パラメータの節約が訓練時間をそれほど速くしなかったという事実のために、私の考えでは、まだLSTMを選ぶと思います。さらに、これらのモデルを数十億のパラメータにスケールする必要はありません。これはtransformersが本当に輝く領域です。10億パラメータ対100億は17,000対170,000とは非常に異なる選択です。

初めてtransformersを使って作業するのは本当に楽しかったですが、この特定の問題にはあまり適していないと思います。しかし、実験中に思い浮かんだ考えがあり、後続の投稿で探求したいと思います：TransformerとLSTMアーキテクチャを組み合わせることで、両方の長所を得ることができるでしょうか？より多くの再現性、より少ないパラメータ数、より良い予測？それについてはお楽しみに。読んでいただき、ありがとうございました！

リンク

Transformersに関する元の論文: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
ソースコードのColabノートブック: https://github.com/maym5/lstm_vs_transformer/blob/main/lstm_vs__transformer.ipynb
Keras Transformerドキュメント: https://keras.io/examples/timeseries/timeseries_transformer_classification/

原文 (English)

So I thought I should continue my discussion of the stock time series problem I began in my first blog post. In that first post, we used LSTM and CNN-LSTM model architectures to predict future stock prices using only previous prices as inputs and we did fairly well. Our mean absolute percentage error was at or below three percent and things were looking good. I mentioned in that post I would discuss ways to improve this figure and initially I had meant to do so by creating an ensemble LSTM model using LSTM nets with different hyperparameters and then averaging their prediction to arrive at (hopefully) something better. But I must admit I got a little sidetracked pursuing a different and I think more interesting line of inquiry. In one word that new approach is Transformers.

What is a Transformer Model Architecture?
Before we get into problem specifics I thought a little exposition would be nice. It should be noted, however, that transformers are a high-level machine learning concept that I'm just getting my feet wet in, and therefore I am far from a subject matter expert. But that being said, teaching (or trying to) is a integral part of learning, so in the spirit of progress, here it goes.

At a very high level a transformer is a feed-forward nueral network architecture that leverages a "self-attention mechanism" to reduce training times and parameter counts while remaining highly predictive. How I have been conceptualizing this is as follows: transformers can do the work of RNN's without all that recurrence slowing things down and adding parameters. This works as follows: a transformer's multihead attention mechnanism (implemented by tf.keras.layers.multiheadattentionin tensorflow) allows the model to keep track of every data point relevant to a specific other data point. Put in the context of our problem, if we're trying to predict tomorow's stock price, today's price is clearly relevant, but other prices are as well. For instance, this week's price action could be very similar to something that happened three months ago, and therefore it is benefical for us if our model can "remember" that price action. This need for "memory" was one reason we chose to use recurrent nueral networks and LSTM (long short term memory) models in the first place. By allowing previous data points to be fed backwards into our model, RNN's can weight previous values, giving us this memory effect. What makes transformers different is their unique self-attention mechanism means that they can have this same "memory" effect without all the recurrence: data is fed in once and self-attention keeps track of relevent data points at each step (it weights that price action that was similar to what were currently seeing from three months ago, say). This means transformers train much faster than RNN's since we only feed the data in once, and they require many fewer parameters, a win-win. For a more cogent and expert explanation of transformers I'll link the original paper describing them at the bottom of this post. But for now, on to our problem.

The Stock Time Series Problem
The question this investigation is trying to answer can be formulated simply as, "Can we predict stock prices over the next five days, using nothing but previous stock prices?". As I have alluded to above we have already partially answered this question with a tentative yes. However, the models we've built, while suprising adept, are far from perfect or even particualry useful: being 3.5% off on average is not enough to feel confident trading on. Part of this is because the problem is simply hard, the data is noisy and very nearly random. They don't call the stock market a random walk for nothing it turns out. But even still we are undeterred and so we are going to plow ahead and tick off the following steps in persuit of a better model:

Establish a reliable baseline LSTM model to evaluate our transformers perfomance against.
Build a transformer model
Train and evaluate our models
Part One: The LSTM
Let's build our baseline model. From earlier experimentation, I've found that having three layers, an LSTM layer with 200 nuerons, and two dense nodes, with 50 and 1 nueron(s) respectively, works very well for this problem. Here's the code to implement this:

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM

def build_lstm(etl: ETL, epochs=25, batch_size=32) -> tuple[tf.keras.Model, tf.keras.History]]:
"""
Builds, compiles, and fits our LSTM baseline model.
"""
n_timesteps, n_features, n_outputs = 5, 1, 5
callbacks = [tf.keras.callbacks.EarlyStopping(patience=10,
restore_best_weights=True)]
model = Sequential()
model.add(LSTM(200, activation='relu',
input_shape=(n_timesteps, n_features)))
model.add(Dense(50, activation='relu'))
model.add(Dense(n_outputs))
print('compiling baseline model...')
model.compile(optimizer='adam', loss='mse', metrics=['mae', 'mape'])
print('fitting model...')
history = model.fit(etl.X_train, etl.y_train,
batch_size=batch_size,
epochs=epochs,
validation_data=(etl.X_test, etl.y_test),
verbose=1,
callbacks=callbacks)
return model, history
And here is the model summary associated with this model once its compile:

So this is the baseline we're going to use to evaluate our transformer, we'll discuss its results in the evaluation section, but for now lets talk about our transformer architecture.

Part Two: The Transformer
Now for our transformer architecture we will be using the construction recommended in the keras documentation (I'll link at the end of the post), but this set up is built for classification so we'll make a small change; we'll change the final output layers activation function from softmax to relu, and our loss function to mean squared error. Aside from that I've set the hyperparameters to what I found to work best for this problem via experimentation. Here's what we get:

def transformer_encoder(inputs, head_size, num_heads, ff_dim,
dropout=0, attention_axes=None):
"""
Creates a single transformer block.
"""
x = layers.LayerNormalization(epsilon=1e-6)(inputs)
x = layers.MultiHeadAttention(
key_dim=head_size, num_heads=num_heads, dropout=dropout,
attention_axes=attention_axes
)(x, x)
x = layers.Dropout(dropout)(x)
res = x + inputs

# Feed Forward Part

x = layers.LayerNormalization(epsilon=1e-6)(res)
x = layers.Conv1D(filters=ff_dim, kernel_size=1, activation="relu")(x)
x = layers.Dropout(dropout)(x)
x = layers.Conv1D(filters=inputs.shape[-1], kernel_size=1)(x)
return x + res

def build_transfromer(head_size,
num_heads,
ff_dim,
num_trans_blocks,
mlp_units, dropout=0, mlp_dropout=0) -> tf.keras.Model:
"""
Creates final model by building many transformer blocks.
"""
n_timesteps, n_features, n_outputs = 5, 1, 5
inputs = tf.keras.Input(shape=(n_timesteps, n_features))
x = inputs
for _ in range(num_trans_blocks):
x = transformer_encoder(x, head_size, num_heads, ff_dim, dropout)

x = layers.GlobalAveragePooling1D(data_format="channels_first")(x)
for dim in mlp_units:
x = layers.Dense(dim, activation="relu")(x)
x = layers.Dropout(mlp_dropout)(x)

outputs = layers.Dense(n_outputs, activation='relu')(x)
return tf.keras.Model(inputs, outputs)

transformer = build_transfromer(head_size=128, num_heads=4, ff_dim=2,
num_trans_blocks=4, mlp_units=[256],
mlp_dropout=0.10, dropout=0.10,
attention_axes=1)

transformer.compile(
loss="mse",
optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
metrics=["mae", 'mape'],
)

callbacks = [tf.keras.callbacks.EarlyStopping(patience=10,
restore_best_weights=True)]

t_hist = transformer.fit(data.X_train, data.y_train, batch_size=32,
epochs=25, validation_data=(data.X_test, data.y_test),
verbose=1, callbacks=callbacks)
And this code will build a model that looks something like this (repeated num_trans_blocks times).

This is one transformer block our model will have 4 of these stacked on one another.
Finally, in total our transformer block has 17,205 total parameters, a little over one tenth of our LSTM. Now let's get training.

Part Three: Training and Evaluation
We will train both nets for 25 epochs with an Adam optimizer. First, lets discuss our baseline models performance on the test data. What I'd like to note first is that the LSTM is remarkably consistant, that is, if you train it 10 consecutive times it will give you predictions that perform very nearly the same on the test set in all ten cases. It also trains more quikly than I expected, 143 seconds in all. It also does well in inference time, taking only twenty-two seconds. This is made all the more impresive since, in this comparison at least, the LSTM is the heavy weight weighing in at a portly 171,000+ parameters as mentioned before.

A visulization of our LSTM's predictions.
In terms of its predictive capability we find the LSTM to be skillful (not terribily suprising, its our baseline for a reason). It scores a MAPE of 2.44% on our test set. All in all the LSTM is a consitant, easily trained model which is skillful at predicting stock time series data. Its one limitation is that it is large, and thus cannot easily be scaled. However, when it comes to stock prices there really isn't enough data to build extremely deep models, if you did you'd actually start to lose performance. Thus, I don't think the heavy parameter count really hurts the LSTM very much.

Here's the Transformer's graph. This one is taken from a model with a MAPE of 2.8%. I failed to get a visualization for the golden models.
Next, let's discuss the transformer. The first thing I'd like to note is that the transformer is more unstable that the LSTM. What do I mean by that? I trained the same models (read: same hyperparameters) many, many times. The result? I trained the best model I've built to predict stock prices, coming in with a MAPE of 2.37%. I built another with MAPE of 2.41%. Both of these, you will note, are better than our baseline, which consistantly weighed in around 2.45%-2.5%. But sadly, the transformer was not able to replicate that performance consistantly, even with the same hyperparameters as those two golden training runs. What's more, it wasn't like the transformer was only doing slightly worse, either. There were times its MAPE topped out over 3%. This is a problem if we're trying to build and then retrain models every month, or quarter say. You might have a powerful transformer model, retrain it, and be left with a pile of junk. So in that respect the transformer was not ideal.

So, where did the transformer really shine? Paramter count. It weighs in at a little over a tenth the number of parameters as the LSTM. This meant that the transformer trained faster than the LSTM, taking only 138 seconds to the LSTM's 143. But surprisingly inference was faster on the LSTM: the transformer took 25 seconds in all.

Finally, in terms of relative variance of predictions vs. the actual values, in other words how much uncertainty is baked into our predictions, the LSTM hedged out the transformer: it was 2.4% to 2.6%.

Conclusion

First of all, it should be noted that the tranformer's performance was really impressive. With only ~17,000 it was able to keep up with an LSTM with over 170,000. That's no mean feet. However, it was unpredictable and unstable in the face of retraining. I don't know if this is a general feature of transformers or a problem with my implementation and usage of them, but it was definitly very noticable. This may not be an issue if you're training a large language model; English doesn't change so rapidly that you have to put out a new model every couple weeks. But for predicting stock prices the ability to retrain as new data flows in without having to worry that your previously very skillful model is now simply mediocre, well that becomes very important. I think for that reason, and the fact that our parameter savings didn't speed up training times but all that much, I think for my money I'd still go for the LSTM. Plus, we don't need to scale these models to billions of parameters, the regime where transformers really shine. 1B parameters vs. 10B is a very different choice than 17,000 vs. 170,000.

So while I really enjoyed working with transformers for the first time, I don't think their terribly well suited to this particular problem. However, while I was experimenting a thought occured to me which I think I'll explore in a subsequent post: By combining the Transformer and LSTM architecures can we get the best of both worlds? More reproducibility, lower parameter counts, and better predictions? For that stay tuned. Thanks for reading!

Links:

The Original Paper on Transformers: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

My Collab Notebook for source code: https://github.com/maym5/lstm_vs_transformer/blob/main/lstm_vs__transformer.ipynb

Keras Transformer docs:

https://keras.io/examples/timeseries/timeseries_transformer_classification/

ML Documentation

目次