Future stock prices prediction based on the historical data using simplified linear regression

Posted on Чт 06 Октябрь 2016 in data analysis

In this post I want give a simplified explanation of what the linear regression model is and how to apply it for data predictions using python and some open python libraries (including scikit-learning).

Supervised learning is one of the major categories of Machine Learning algorithms. "Supervised" means we already have a dataset in which "correct answers" were given. For example, we have a stock data with open values and close values for a past few years, and we want to predict future values (prices or indexes). Supervised learning is subdivided into Regression problem and Classification problem. Regression problem means we're trying to predict a continuous value output (like predict stock value).

Here is the Machine Learning project described that tries to predict stock data using linear regression algorithm. Linear regression is the most basic and commonly used predictive analysis. All input data should be put in the matrix X, each column of the matrix represents a data example. And the "answers" should be put in vector y. Then we can advance the hypothesis:

\begin{equation} h_{\theta}(x) = \theta^T X, \end{equation}

where X $-$ matrix $(n+1) \times m$, $\theta -$ vector of "weights" features (parameters) length (n+1), m $-$ number of examples, n $-$ number of features (parameters).

Since our goal is to train the model as close as possible to real values, we should minimize the cost function $J(\theta)$:

\begin{equation} J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(X^{(i)}) - y^{(i)})^2. \end{equation}

This is a simplified explanation of a linear regression model. Now we can test it on a real data.

Please, pay attention! This project is used for studying of the linear regression application. If you want to use this project in a real life, all responsibility lies with you! Author doesn't take any responsibility for results of using this project in real-life stock predictions.

I assume everybody knows how time-dependent stock data looks like. Here is the example of the Facebook stock data:

In [16]:
X = pd.read_csv(os.path.join(settings.PROCESSED_DIR, settings.PROCESSED_X), header=0)
y = pd.read_csv(os.path.join(settings.PROCESSED_DIR, settings.PROCESSED_Y), header=0)
X['Date'] = pd.to_datetime(X['Date'], format='%Y-%m-%d')
plt.xticks(rotation=45)
plt.plot_date(X['Date'], y, fmt='b-', xdate=True, ydate=False)
plt.ylabel('Close prices')
plt.title('Historical chart', y=1.1)
plt.suptitle('Facebook Inc(NASDAQ:FB)', y=0.97)
plt.grid()
plt.show()
date = X.loc[:, ['Date']]
X['Date2num'] = X['Date'].apply(lambda x: mdates.date2num(x))
del X['Date']

Historical stock data are shared at Google Finance.

Next libraries are required for the project:

For training the model I split data into two sets, one for training and another for test. Usually, it is used 30% of data for test set and 70% for training set.

In [17]:
# m - number of examples
# n - number of features
m, n = X.shape

# test set is 30%
# train set is 70%
X_test = X.loc[:np.floor(m*0.3)]
X_train = X.loc[np.floor(m*0.3)+1:]

y_test = y.loc[:np.floor(m*0.3)]
y_train = y.loc[np.floor(m*0.3)+1:]

date_test = date.loc[:np.floor(m*0.3)]
date_train = date.loc[np.floor(m*0.3)+1:]

# Create linear regression object
lr = LinearRegression()

# Train the model using the training sets
lr.fit(X_train, y_train)

# The coefficients
print('Coefficients: \n', lr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((lr.predict(X_test) - y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % lr.score(X_test, y_test))

# Plot outputs
plt.xticks(rotation=45)
plt.plot_date(date_test, y_test, fmt='b-', xdate=True, ydate=False, label='Real value')
plt.plot_date(date_test, lr.predict(X_test), fmt='r-', xdate=True, ydate=False, label='Predicted value')
plt.legend(loc='upper center')
plt.ylabel('Close prices')
plt.title('Facebook Inc(NASDAQ:FB)')
plt.grid()
Coefficients: 
 [[ 0.06672737]]
Residual sum of squares: 32.99
Variance score: 0.79

As you can see, sum of squares on the test examples are big enough, but variance score looks good, because the model fits good. The model looks even better in perspective:

In [19]:
plt.xticks(rotation=45)
plt.plot_date(date, y, fmt='b-', xdate=True, ydate=False, label='Real value')
plt.plot_date(date, lr.predict(X), fmt='r-', xdate=True, ydate=False, label='Predicted value')
plt.legend(loc='upper center')
plt.ylabel('Close prices')
plt.title('Facebook Inc(NASDAQ:FB)')
plt.grid()

This simplified model might already be useful for long-term predictions.

For short-term predictions I need more accuracy. What can I do to fit the model better? For a start I add open prices to the matrix X.

In [20]:
# Create linear regression object
lr = LinearRegression()

# Train the model using the training sets
lr.fit(X_train, y_train)

# The coefficients
print('Coefficients: \n', lr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((lr.predict(X_test) - y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % lr.score(X_test, y_test))

# Plot outputs
plt.xticks(rotation=45)
plt.plot_date(date_test, y_test, fmt='b-', xdate=True, ydate=False, label='Real value')
plt.plot_date(date_test, lr.predict(X_test), fmt='r-', xdate=True, ydate=False, label='Predicted value')
plt.legend(loc='upper center')
plt.ylabel('Close prices')
plt.title('Facebook Inc(NASDAQ:FB)')
plt.grid()
Coefficients: 
 [[  9.86449898e-01   9.84514636e-04]]
Residual sum of squares: 1.98
Variance score: 0.99

So, sum of the squares looks better, and the variance score too. Actually, variance score looks great, almost ideal (best possible score is 1.0).

Just to satisfy my curiosity, I added more features, as open and close values previous day, open and close values two days before today and so on.

In [27]:
def read():
    X = pd.read_csv(os.path.join(settings.PROCESSED_DIR, settings.PROCESSED_X), header=0)
    y = pd.read_csv(os.path.join(settings.PROCESSED_DIR, settings.PROCESSED_Y), header=0)
    X['Date'] = pd.to_datetime(X['Date'], format='%Y-%m-%d')
    return X, y


def modify(X, columns):
    columns.append('Date2num')
    returnX = X.loc[:, columns]
    return returnX


def predict(X, date):
    m, n = X.shape

    X_test = X.loc[:np.floor(m*0.3)]
    X_train = X.loc[np.floor(m*0.3)+1:]

    y_test = y.loc[:np.floor(m*0.3)]
    y_train = y.loc[np.floor(m*0.3)+1:]

    date_test = date.loc[:np.floor(m*0.3)]
    date_train = date.loc[np.floor(m*0.3)+1:]

    # Create linear regression object
    lr = LinearRegression()

    # Train the model using the training sets
    lr.fit(X_train, y_train)

    # The mean square error
    print("Number of variables: %d; Residual sum of squares: %.2f; Variance score: %.2f"
          % (n, np.mean((lr.predict(X_test) - y_test) ** 2), lr.score(X_test, y_test)))
    

X, y = read()
X['Date2num'] = X['Date'].apply(lambda x: mdates.date2num(x))
date = X.loc[:, ['Date']]
del X['Date']

# predict with one variable - time
X1 = modify(X, [])
predict(X1, date)

# predict with two variables - time and open price the same day
predict(modify(X, ['Open']), date)

predict(modify(X, ['Open','OpenPrev','ClosePrev']), date)

predict(modify(X, ['Open','OpenPrev','ClosePrev','OpenPrev2','ClosePrev2']), date)

predict(modify(X, ['Open','OpenPrev','ClosePrev','OpenPrev2','ClosePrev2',
                   'OpenPrev3','ClosePrev3']), date)

predict(modify(X, ['Open','OpenPrev','ClosePrev','OpenPrev2','ClosePrev2',
                   'OpenPrev3','ClosePrev3','OpenPrev4','ClosePrev4']), date)

predict(modify(X, ['Open','OpenPrev','ClosePrev','OpenPrev2','ClosePrev2',
                   'OpenPrev3','ClosePrev3','OpenPrev4','ClosePrev4','OpenPrev5','ClosePrev5']), date)
Number of variables: 1; Residual sum of squares: 31.02; Variance score: 0.80
Number of variables: 2; Residual sum of squares: 1.99; Variance score: 0.99
Number of variables: 4; Residual sum of squares: 1.94; Variance score: 0.99
Number of variables: 6; Residual sum of squares: 1.93; Variance score: 0.99
Number of variables: 8; Residual sum of squares: 1.95; Variance score: 0.99
Number of variables: 10; Residual sum of squares: 1.96; Variance score: 0.99
Number of variables: 12; Residual sum of squares: 1.97; Variance score: 0.99

So, variance score is almost the best for each of the cases and residual sums of the squares are similar to each other. As you can see, the case with 6 variables possesses the lowest sum of squares. Therefore, we can empirically choose the best number of variables and use it for future predictions.

You may download the project from github and try it yourself.