Using Apple’s Financial News to Predict its Stock Price Movement

9 min readApr 26, 2021

Introduction

My name is Rui Ding, and I’m currently a first-year MSI graduate student at School of Information, University of Michigan. In this Natural Language Processing couse project, I proposed an Long Short Term Memory (LSTM) model to predict the stock price movement of Apple Inc. using related financial news as input. This project will provide insight into how financial news reflects stock prices, which gives investors an unique perspective to make their decisions. The model achieved 60.34% accuracy on the test dataset, which performs better than random classifier (49.23%) and logistic regression baseline (56.97%). The best result so far was produced by Ding X. et al. (2015), which got 65.08% accuracy on their test dataset. The performance of my model is not competitive enough to the best model, but still better than random guess and baseline.

Data

Financial news related to Apple Inc. were scraped from CNBC, Guardian, and Reuters official websites, which contains both headlines and contents for 20231 different news report on each day from July 19, 2012 to January 27, 2020. Sample data looks like what’s shown in Table 1.

The stock price data of Apple Inc. can be easily found on Yahoo Finance. Here we also set the range from July 19, 2012 to January 27 to match our financial news dataset. We can shift the adjusted close price by one day and do the difference to find rise or fall of stock price for each day relative to the previous day. Then these values are further transformed into binary form which are show in Table 2.

stock_price_df = pd.read_csv('AAPL_stock_price.csv')
stock_price_df['Adj Close'] - stock_price_df['Adj Close'].shift(1)
labels = stock_price_df['Adj Close'] - stock_price_df['Adj Close'].shift(1)
labels = labels.dropna()
labels[labels >= 0] = 1
labels[labels < 0] = 0
stock_price_df['labels'] = labels
stock_price_df = stock_price_df[['Date', 'labels']]
stock_price_df

We concatenated these two tables to form our finalized dataset. Notice that stock market only trades on weekdays, so we were left with 17578 lines of news after merging, where our financial news for each trading day are corresponding to the signal label of that day, which is shown in Table 3.

merge_df = pd.merge(stock_price_df, apple_df, on='Date', how='inner').dropna().set_index('Date')

Baseline

Before introducing our models, two baseline methods are carried out to serve as a reference.

Random Classifier

Random predictions of labels were generated to predict stock movement on a certain day. We generated the random classifier for three different times to check the average performance. The performance result is shown in Table 4.

Logistic Regression

The second baseline is doing logistic regression without any feature extraction. This method just transformed the corpus into word index representation and fit into a logistic regression model to predict the labels.

Methods

Since Stock price is a classic time series model, the price of stock on a given day can be influenced by different factors of previous days, weeks, or even months. That requires our model to be able to capture past features for future prediction. Obviously, Recurrent Neural Network(RNN) is good at “remembering” features from the past. Based on this reason, Long Short Term Memory (LSTM), Gate Recurrent Unit (GRU) and other RNN models maybe a good choice for text training.

RNN, LSTM, GRU Model Achitecture Diagram

LSTM With Sequence Encoding

Sequence encoding is the one of most common way to preprocess the text corpus in NLP. For each sample, we first tokenize the news and then transform it into sequences using Tokenizer. After the text was tokenized, their sequence length are not the same. So we set the max length attribute in pad_sequence function to 5000 to keep all the sequences the same length. For labels, we use to_categorical function to transform them into One-hot encoding form. The output of our RNN model will be reshaped into a dense layer. The last stage is a traditional fully connected layer with softmax as activation function whose output is the probability distribution over labels.

tokenizer = Tokenizer(num_words=max_features)tokenizer.fit_on_texts(train_contents_list)sequences_train = tokenizer.texts_to_sequences(train_contents_list)sequences_test = tokenizer.texts_to_sequences(test_contents_list)X_train = sequence.pad_sequences(sequences_train, maxlen=maxlen)X_test = sequence.pad_sequences(sequences_test, maxlen=maxlen)Y_train = np_utils.to_categorical(y_train, nb_classes)Y_test = np_utils.to_categorical(y_test, nb_classes)print('X_train shape:', X_train.shape)print('X_test shape:', X_test.shape)print('Y_train shape:', Y_train.shape)print('Y_test shape:', Y_test.shape)X_train shape: (1500, 5000) 
X_test shape: (290, 5000) 
Y_train shape: (1500, 2) 
Y_test shape: (290, 2)

Since LSTM only takes float tensor as input, here we use Embedding layer to turn positive integers (indexes) into dense vectors of fixed size. Then we fed the output into the LSTM model and further connected it to a dense layer.

model = Sequential()model.add(Embedding(max_features, 128))model.add(LSTM(128, dropout=0.5))# model.add(Embedding(max_features, 256))# model.add(LSTM(256, dropout=0.25))model.add(Dense(32, activation="relu"))model.add(Dropout(0.5))model.add(Dense(2, activation="softmax"))

After 10 epochs of training, the model got 60.34% accuracy on the test data.

score, acc = model.evaluate(X_test, Y_test, batch_size=batch_size)print('Test accuracy:', acc)10/10 [==============================] - 1s 91ms/step - loss: 1.2400 - accuracy: 0.6034 Test accuracy: 0.6034482717514038

I also tried removing the dense layer to see the performance of simplified the model. The model got 57.79% accuracy on the test data, which is worse than the one with dense layer.

score, acc = model.evaluate(X_test, Y_test, batch_size=batch_size)print('Test accuracy:', acc)10/10 [==============================] - 1s 90ms/step - loss: 1.1231 - accuracy: 0.5759 Test accuracy: 0.5758620500564575

We also tried GRU models with/without additional dense layer. The performance of them are not as good as LSTM model.

LSTM With word2vec Embedding

Although sequence encoding method performs decent result, we still tried using pre-trained word2vec embedding to see whether we can improve the model’s performance since pre-trained word2vec caught the general semantics of words.

Here we used pre-trained word2vec embedding file called GoogleNews-vectors-negative300-SLIM.bin. This slim model file contains around 299,567 words and each of them is represented by a 300-dimensional vector.

import gensim
w2v_mod = gensim.models.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300-SLIM.bin", binary=True)

Then we useaverage the word embeddings in news to get the news embedding.

embedding_array = np.zeros((len(train_contents_list), 300))for i in range(len(train_contents_list)):    sentence_array = np.zeros(300)    word_count = 0    tokens = train_contents_list[i].split()    for word in tokens:        try:            sentence_array += w2v_mod[word]            word_count += 1        except:            continue    sentence_array /= word_count    embedding_array[i] = sentence_array

LSTM takes a 3D input (num_samples, num_time_steps, num_features). So, I create a helper function, create_dataset, to reshape input.

def create_dataset (X, y, time_steps = 1):    Xs, ys = [], []    for i in range(len(X)-time_steps):        v = X[i:i+time_steps, :]        Xs.append(v)        ys.append(y[i+time_steps])    return np.array(Xs), np.array(ys)

At first, I defined time_steps = 7. It means that the model makes predictions based on the last 7-day data.

TIME_STEPS = 7X_train, Y_train = create_dataset(X_train, Y_train, TIME_STEPS)X_test, Y_test = create_dataset(X_test, Y_test, TIME_STEPS)print('X_train.shape: ', X_train.shape)print('Y_train.shape: ', Y_train.shape)print('X_test.shape: ', X_test.shape)print('Y_test.shape: ', Y_test.shape)X_train.shape:  (1493, 7, 300) Y_train.shape:  (1493, 2) X_test.shape:  (283, 7, 300) Y_test.shape:  (283, 2)

The model contains a LSTM layer that further connected to a dense layer.

model = Sequential()model.add(LSTM(128, return_sequences = True))model.add(Dropout(0.2))model.add(LSTM(128))model.add(Dropout(0.2))# model.add(Dense(32))# model.add(Dropout(0.2))model.add(Dense(2, activation="softmax"))

Finally the model got 57.60% accuracy on the test data set.

score, acc = model.evaluate(X_test, Y_test, batch_size=batch_size)print('Test accuracy:', acc)36/36 [==============================] - 1s 2ms/step - loss: 0.6873 - accuracy: 0.5760 Test accuracy: 0.5759717226028442

I changed the time_step = 30 in order to see whether extend the period would help improve the performance of model. This time the model got 58.46% accuracy on the test data set, which is a better than 7-day model.

score, acc = model.evaluate(X_test, Y_test, batch_size=batch_size)print('Test accuracy:', acc)33/33 [==============================] - 1s 3ms/step - loss: 0.6866 - accuracy: 0.5846 Test accuracy: 0.5846154093742371

Evaluation & Results

This project uses accuracy on testing set and F1 score to evaluate the performance of different models and choose the best one as most of researchers in this field did. Accuracy means the percentage of our predicted label is actually the truth label. The F1 score takes consideration of both accuracy and recall and gave us an overall performance score of a candidate model.

For the performance of our model, we only present the best result of each type of model achieved on the test data. The results are put on the same table together with baseline methods for comparison.

First, all the models using recurrent neural network are performing better than logistic regression.

In the sequence encoding method, we compare two different model structure type:

Model with an additional dense layer vs. No dense layer
RNN using LSTM Layer vs. RNN using GRU Layer

Model with additional dense layer performs better than the one without dense layer on both accuracy and F1 score for both LSTM and GRU model. Model using LSTM performs better than the one using GRU on both accuracy and F1 score regardless of the existence of dense layer.

In the word2vec method, we compare model learns from previous 7 days and the model learns from previous 30 days. The latter got higher score on both accuracy and F1 score.

In short, the best model in terms of accuracy we got is the sequence encoding LSTM model which achieved 60.34% accuracy on the test data. The best model in terms of F1 score we got is the Word2Vec encoding LSTM model with 30 days of time step which achieved 0.7379 on the test data. But accuracy is more important in the field of stock price prediction, so we think the former is a better model.

Discussion

From the results shown above, we can see:

The performance of random classifier is bad Since it did not learn anything about the corpus, both the accuracy score and the F1 score is pretty close to 0.5.

Logistic regression trained on title corpus and content corpus had higher performance score than the random classifier. Their scores are little higher than 0.5 which suggested the models learned some useful features through training. However, Logistic Regression did not achieve very high performance because the stock price movement of each day is treated as independent event in simple logistic regression, which is counter intuitive. Stock price is a typical time series model, which means the price of a stock on a given day can be influenced by different factors of the previous days, weeks or even months. That requires our model to be able to capture past features for future prediction.

When it comes to our models using recurrent neural network, both of two methods performs better than logistic regression. One interesting fact is that the sequence encoding model performs better than word2vec model. This is not within my expectation because I thought word2vec representation may improve performance. This may be caused by relatively small data size of this project.

Regardless of all the models I have tried with different hyper-parameters, even the model with the best performance of is still not close to the top-level in this field. From my view, the reasons may be as follows:

No high-level features are extracted. Not every word in the financial news are useful features. Economically relevant features need to be extracted for training.
The data size is too small for deep learning. Although I have tries my best to collect news spreading 8 years from 2012 to 2020, the data size is still too small to train a smart deep learning model.
No sentiment-analysis. Sentiment in financial news is closely related to the rise and fall of stock prices. A positive piece of news usually means a rise in stocks in the days ahead, and vice versa.

What’s Next

Although our model got descent result, it still has large room to improve. As we mentioned above, we can do the following future work to improve our model performance:

Extract high-level features using convolutional layer and maxpooling layer. As mentioned in the literature, CNN performs better than RNN when it comes to detect latent dynamics of stock market.
Collect more data. As easy as it sounds, this is actually the most difficult part for stock prediction field. There’s only about 255 trading days in each year. Maybe data argumentation method can be somehow applied here.
Include sentiment analysis information. Sentiment in financial news is closely related to the rise and fall of stock prices. A positive piece of news usually means a rise in stocks in the days ahead, and vice versa. Sentiment analysis of media may provide additional useful information.