Introduction to QuantConnect
This is written for those who have some knowledge of Python and basic finance terms. If you don't get anything, ask Claude 3.5 Sonnet or ask in #general.
QuantConnect is the platform where we run our algorithms. You can use C#, but we'll be using Python for the sake of simplicity. Opening an account is pretty easy, but you'll have to link a credit card. Don't worry about it too much, you won't be charged unless you explicitly sign up for paid tiers.
Diving straight in, you will be greeted with this code when you start a blank QuantConnect file:
# region imports from AlgorithmImports import * # endregion class HipsterApricotHyena(QCAlgorithm): def initialize(self): self.set_start_date(2023, 7, 13) self.set_cash(100000) self.add_equity("SPY", Resolution.MINUTE) self.add_equity("BND", Resolution.MINUTE) self.add_equity("AAPL", Resolution.MINUTE) def on_data(self, data: Slice): if not self.portfolio.invested: self.set_holdings("SPY", 0.33) self.set_holdings("BND", 0.33) self.set_holdings("AAPL", 0.33)
There are three functions you need to care about: initialize(), onData(), and train().
- Initialize() is where you declare all your details: starting cash, symbols, test period, train period, etc. More settings available here.
- OnData() is the the code that gets looped every unit time (e.g day) for the test period. This is where you want to apply your strategy.
- train() gets run before onData() to train your model. For larger models (e.g RNNs) you might want to pretrain it somewhere else and call it directly in onData(), but that comes later.
We'll be using SPY in this tutorial. We'll also change how it's represented in the initialize() function just to make it easier to deal with.
# region imports from AlgorithmImports import * # endregion class HipsterApricotHyena(QCAlgorithm): def initialize(self): self.set_start_date(2023, 7, 13) self.set_end_date(2024, 12, 31) self.set_cash(100000) # this lets us use it to compare it against a buy-and-hold strategy later self.symbol = self.AddEquity("SPY", Resolution.Daily).Symbol self.set_benchmark(self.symbol) def on_data(self, data: Slice):
- AddEquity() adds the particular symbol (SPY in this case) into your model
- The Resolution parameter defines the frequency of data - it gives you the averaged value for that particular frequency
- self.set_benchmark is a function we'll need to define later. It won't work right now.
Broadly speaking, the algorithm can be broken down into two components:
- Prediction
- Strategy
Prediction refers to obtaining as accurate as possible a value for a specific metric - price, volatility, etc. The better your prediction, the more confidence you can place on your model, and the larger the sum you can entrust to your algorithm.
Strategy, on the other hand, refers to what you do with the value. For instance, you might know that SPY (selling for 582.19 at time of writing) might reach 600 tomorrow, then drop to 200 the day after. The most straightforward thing to do would be to buy it immediately, sell tomorrow, and short it for the day after. If you're not certain in your model's prediction, you might hedge it with call/put options to cap your potential losses. If you have supreme confidence in your prediction, you might sell a ton of put options the first day and call options the second. The number of combinations and permutations of things you can do only increases from here, and that's for you to study in your own time. Resources are linked at the bottom.
In this tutorial, we will be implementing logistic regression with momentum on SPY. Add these imports at the top of your file:
import pandas as pd import numpy as np from sklearn.linear_model import LogisticRegression
Modeling SPY
A model, in essence, is a learned function that maps input 'X' to prediction 'Y'. In this tutorial, we'll be predicting the future direction given the relative differences between the past three days. To do that, we first define the training period with self.History() in initialize():
def initialize(self): self.set_start_date(2023, 7, 13) self.set_end_date(2024, 12, 31) self.set_cash(100000) self.symbol = self.AddEquity("SPY", Resolution.Daily).Symbol self.set_benchmark(self.symbol) history = self.History(self.symbol, 1800, Resolution.Daily) self.model = self.train(history)
Looking at the three parameters we have, self.symbol, 1800, and Resolution.Daily
- self.symbol and Resolution.Daily are self explanatory - while you can use other symbols and resolutions, we'll keep it the same as the test data
- 1800 simply defines the number of days in the test period. Again, you can further define it and increase specificity if you so wish.
- self.model = self.train(history) is where you keep the trained model to be used in test time. This is what we'll be working on next.
It's important to note that data is king. Andrej Karpathy states this very well:
The first step to training a neural net is to not touch any neural net code at all and instead begin by thoroughly inspecting your data. This step is critical. I like to spend copious amount of time (measured in units of hours) scanning through thousands of examples, understanding their distribution and looking for patterns.
Therefore, we'll always start by looking at the data. To do this, we'll move to the research.ipynb located in the same directory.
Here's the initalization code, we will skip all those details for now and look directly at the data:
# Add this to the top of your research.ipynb qb = QuantBook() spy = qb.AddEquity("SPY") df = qb.History(qb.Securities.Keys, 1800, Resolution.Daily) df
Next comes playing around with pandas. While we highly encourage you to experiment, that's for you to do in your own time. Here, we are using the relative changes between days, then normalizing them.
# trailing average values for the past three days df['average'] = (df['high'] + df['low']) / 2 df['lag_1'] = df['average'].shift(1) df['lag_2'] = df['average'].shift(2) df['lag_3'] = df['average'].shift(3) # interday differences df['diff_1'] = df['average'] - df['lag_1'] df['diff_2'] = df['lag_1'] - df['lag_2'] df['diff_3'] = df['lag_2'] - df['lag_3'] # interday gradients df['grad_1'] = df['diff_1'] / df['lag_1'] # relative to the previous day df['grad_2'] = df['diff_2'] / df['lag_2'] df['grad_3'] = df['diff_3'] / df['lag_3'] def normalize(series): mean = series.mean() std = series.std() return (series - mean) / std df['grad_1_norm'] = normalize(df['grad_1']) df['grad_2_norm'] = normalize(df['grad_2']) df['grad_3_norm'] = normalize(df['grad_3'])
This constitutes the 'X' element of the function. For Y, we make another column describing whether the stock moves up or down the next day.
# 1 if it moves up, 0 if it stays the same/moves down df['target'] = np.where(df['average'].shift(-1) > df['average'], 1, 0)
We can just leave this as a pandas DataFrame for later. Now we can go back to the main algorithm and look at the train() function:
def train(self, history): df = pd.DataFrame(history) # trailing average values for the past three days df['average'] = (df['high'] + df['low']) / 2 df['lag_1'] = df['average'].shift(1) df['lag_2'] = df['average'].shift(2) df['lag_3'] = df['average'].shift(3) # interday differences df['diff_1'] = df['average'] - df['lag_1'] df['diff_2'] = df['lag_1'] - df['lag_2'] df['diff_3'] = df['lag_2'] - df['lag_3'] # interday gradients df['grad_1'] = df['diff_1'] / df['lag_1'] # relative to the previous day df['grad_2'] = df['diff_2'] / df['lag_2'] df['grad_3'] = df['diff_3'] / df['lag_3'] def normalize(series): mean = series.mean() std = series.std() return (series - mean) / std df['grad_1_norm'] = normalize(df['grad_1']) df['grad_2_norm'] = normalize(df['grad_2']) df['grad_3_norm'] = normalize(df['grad_3']) # target df['target'] = np.where(df['average'].shift(-1) > df['average'], 1, 0) if df.empty: self.Log("Training data is empty. Cannot train model.") return None
And for the final step of this section, we create the X and Y features, initialize the model, and train it. Almost everything is abstracted away by the libraries, so it's worthwhile to manually write it out every now and then.
X = df[['grad_1_norm', 'grad_2_norm', 'grad_3_norm']].dropna() # dropna() removes the empty values created when we shifted the data y = df[['target']] model = LogisticRegression() model.fit(X, y)
This is how your function should look altogether now:
def train(self, history): df = pd.DataFrame(history) # trailing average values for the past three days df['average'] = (df['high'] + df['low']) / 2 df['lag_1'] = df['average'].shift(1) df['lag_2'] = df['average'].shift(2) df['lag_3'] = df['average'].shift(3) # interday differences df['diff_1'] = df['average'] - df['lag_1'] df['diff_2'] = df['lag_1'] - df['lag_2'] df['diff_3'] = df['lag_2'] - df['lag_3'] # interday gradients df['grad_1'] = df['diff_1'] / df['lag_1'] # relative to the previous day df['grad_2'] = df['diff_2'] / df['lag_2'] df['grad_3'] = df['diff_3'] / df['lag_3'] def normalize(series): mean = series.mean() std = series.std() return (series - mean) / std df['grad_1_norm'] = normalize(df['grad_1']) df['grad_2_norm'] = normalize(df['grad_2']) df['grad_3_norm'] = normalize(df['grad_3']) # target df['target'] = np.where(df['average'].shift(-1) > df['average'], 1, 0) if df.empty: self.Log("Training data is empty. Cannot train model.") return None X = df[['grad_1_norm', 'grad_2_norm', 'grad_3_norm']].dropna() # dropna() removes the empty values created when we shifted the data y = df[['target']] model = LogisticRegression() model.fit(X, y) return model
Using the Prediction
Now we move onto the OnData() function, which will be run when we want to backtest or trade.
Let's first add error checking - just paste it at the top of the function:
if not data.ContainsKey(self.symbol): return if self.model is None: self.Log("Model is not trained. Skipping prediction.") return
Predicting the Movement
Again, we want to predict the movement based on the previous days' movements. So we take the previous days' data, preprocess it, and apply the model. This is pretty standard Python, so I won't bore you with the details:
history = self.History(self.symbol, 10, Resolution.Daily) # past 10 days' worth of data df = self.prepare_data(history) if df.empty: self.Log("DataFrame is empty. Skipping this OnData call.") return # obtaining input data, making prediction latest_data = df.iloc[[-1]][['grad_1_norm', 'grad_2_norm', 'grad_3_norm']] pred = self.model.predict(latest_data)[0]
Making a Decision
And finally, making a decision. I'll show you the code first, then explain it:
buy_threshold = 0.6 sell_threshold = 0.4 portfolio_weight = 0.7 if pred >= buy_threshold: self.SetHoldings(self.symbol, portfolio_weight) purchase_price = self.Portfolio[self.symbol].Price elif pred <= sell_threshold: self.liquidate() else: pass
In essence, every strategy can be boiled down to choosing:
- When to buy
- When to sell
And there are many ways in which you make this decision. You can use price, volatility, momentum, P/B, and essentially any financial metric to decide; as well as whatever information you deem important and high-signal enough to support your decision. Here, we use momentum, basing off a stock's tendency to continue moving in its current direction.
We'll skip the discussion on the code and focus on the functions involved:
- self.SetHoldings() calculates the number of asset units to purchase based on the portfolio weight - as in, the buying power you currently have (0.5 of \$100 = \$50).
- self.Portfolio[self.symbol].Price gets the purchase price of the symbol.
- self.liquidate() liquidates everything. You can pass parameters in to further define how much and what symbols you'd like to liquidate, but here, we are dumping everything if the threshold goes below 0.4.
As your strategies become more sophisticated, the more functions and features you'll be able to involve. QuantConnect lacks good documentation, so reading other people's code can provide you with better insight into the action space.
Putting Everything Together
And now, putting everything (everything) together:
# region imports from AlgorithmImports import * import pandas as pd import numpy as np from sklearn.linear_model import LogisticRegression # endregion class HipsterApricotHyena(QCAlgorithm): def initialize(self): self.set_start_date(2023, 7, 13) self.set_end_date(2024, 12, 31) self.set_cash(100000) self.symbol = self.AddEquity("SPY", Resolution.DAILY).Symbol # benchmark graph. very useful for visualizations self.set_benchmark(self.symbol) self.cap = 100000 self.benchmark_chart = [] history = self.History(self.symbol, 1800, Resolution.DAILY) self.model = self.train(history) def on_data(self, data: Slice): # some checking to prevent errors if not data.ContainsKey(self.symbol): return if self.model is None: self.Log("Model is not trained. Skipping prediction.") return self.plot_market() # ================================================================ # predicting the movement # ================================================================ # obtaining history history = self.History(self.symbol, 10, Resolution.Daily) # past 10 days' worth of data df = self.prepare_data(history) if df.empty: self.Log("DataFrame is empty. Skipping this OnData call.") return # obtaining input data, making prediction latest_data = df.iloc[[-1]][['grad_1_norm', 'grad_2_norm', 'grad_3_norm']] pred = self.model.predict(latest_data)[0] self.Log(f"Date: {self.Time.date()}, Predicted Movement: {pred}") # ================================================================ # using the prediction # ================================================================ buy_threshold = 0.6 sell_threshold = 0.4 portfolio_weight = 0.7 if pred >= buy_threshold: self.SetHoldings(self.symbol, portfolio_weight) purchase_price = self.Portfolio[self.symbol].Price # self.PlaceStopMarketOrder(purchase_price, direction="long") # hedging elif pred <= sell_threshold: self.liquidate() else: pass def train(self, history): df = self.prepare_data(history) X = df[['grad_1_norm', 'grad_2_norm', 'grad_3_norm']] y = df[['target']] model = LogisticRegression() model.fit(X, y) return model def prepare_data(self, history): df = pd.DataFrame(history) # trailing average values for the past three days df['average'] = (df['high'] + df['low']) / 2 df['lag_1'] = df['average'].shift(1) df['lag_2'] = df['average'].shift(2) df['lag_3'] = df['average'].shift(3) # interday differences df['diff_1'] = df['average'] - df['lag_1'] df['diff_2'] = df['lag_1'] - df['lag_2'] df['diff_3'] = df['lag_2'] - df['lag_3'] # interday gradients df['grad_1'] = df['diff_1'] / df['lag_1'] # relative to the previous day df['grad_2'] = df['diff_2'] / df['lag_2'] df['grad_3'] = df['diff_3'] / df['lag_3'] def normalize(series): mean = series.mean() std = series.std() return (series - mean) / std df['grad_1_norm'] = normalize(df['grad_1']) df['grad_2_norm'] = normalize(df['grad_2']) df['grad_3_norm'] = normalize(df['grad_3']) # target df['target'] = np.where(df['average'].shift(-1) > df['average'], 1, 0) df = df.dropna(subset=['grad_1', 'grad_2', 'grad_3', 'target']) if df.empty: self.Log("Training data is empty. Cannot train model.") return pd.DataFrame() return df def PlaceStopMarketOrder(self, purchase_price, direction): if direction == "long": stop_price = purchase_price * 0.99 # 1% below purchase price self.stopMarketTicket = self.StopMarketOrder(self.symbol, -self.Portfolio[self.symbol].Quantity, stop_price) self.Log(f"Placed stop market order to sell at ${stop_price:.2f} (0.5% below purchase price).") elif direction == "short": stop_price = purchase_price * 1.01 # 1% above purchase price self.stopMarketTicket = self.StopMarketOrder(self.symbol, -self.Portfolio[self.symbol].Quantity, stop_price) self.Log(f"Placed stop market order to cover at ${stop_price:.2f} (0.5% above purchase price).") def plot_market(self): #plot the market on the Startegy Equity Chart with your portfolio hist = self.History([self.symbol], 252, Resolution.Daily)['close'].unstack(level=0).dropna() self.benchmark_chart.append(hist[self.symbol].iloc[-1]) benchmark_perf = self.benchmark_chart[-1] / self.benchmark_chart[0] * self.cap self.Plot("Strategy Equity", "Buy & Hold", benchmark_perf)
And there you have it - a working (albeit underperforming benchmark) trading algorithm! We've also included hedging and the plot_market() function for you to see how you perform compared to just buying and holding. For any more questions, ask in Discord.
Acknowledgements
- Claude 3.5 Sonnet
- @michaelbol