Candlestick patterns for the S&P 500¶
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from collections import Counter
from pathlib import Path
from collections import namedtuple
from scipy import stats
from BSquant import load_data
from BSquant import process_data
from BSquant import cs_pattern_recognition
from BSquant import cs_performance
from BSquant import plot_cs_performance
from BSquant import compute_trading_strategy_performance
pd.set_option("display.max_columns", None)
%load_ext autoreload
%autoreload 2
Loading data for the S&P500¶
# Define the path to your ticker file
ticker_file = "./../data/SP500_tickers_one_per_line.txt"
notebooks_dir = Path("./../notebooks")
ticker_file_path = notebooks_dir.parent / "data" / ticker_file
tickers = []
# Open the ticker-file with a context manager and read each line adding ot to the list of tickers
with open(ticker_file_path, "r") as file:
for line in file:
ticker = line.strip() # strip newline characters and whitespace
tickers.append(ticker) # add the cleaned ticker to the list
print("Number of tickers (may include multiple tickers per stock) is", len(tickers))
print("Number of unique tickers is:", set(tickers).__len__())
Number of tickers (may include multiple tickers per stock) is 503 Number of unique tickers is: 503
# enumerated tickers in the ticker_file
for i, ticker in enumerate(tickers):
print(f"{i+1}:{ticker}")
1:MSFT 2:AAPL 3:AMZN 4:NVDA 5:GOOGL 6:META 7:GOOG 8:TSLA 9:BRK.B 10:UNH 11:LLY 12:JPM 13:XOM 14:V 15:AVGO 16:JNJ 17:PG 18:MA 19:HD 20:ADBE 21:COST 22:MRK 23:CVX 24:ABBV 25:CRM 26:PEP 27:KO 28:WMT 29:BAC 30:ACN 31:NFLX 32:MCD 33:LIN 34:CSCO 35:AMD 36:TMO 37:INTC 38:ORCL 39:ABT 40:CMCSA 41:PFE 42:DIS 43:WFC 44:VZ 45:INTU 46:DHR 47:PM 48:IBM 49:AMGN 50:QCOM 51:NOW 52:TXN 53:COP 54:UNP 55:SPGI 56:NKE 57:GE 58:BA 59:HON 60:CAT 61:AMAT 62:RTX 63:T 64:NEE 65:LOW 66:SBUX 67:ELV 68:GS 69:BKNG 70:UPS 71:ISRG 72:PLD 73:MDT 74:BLK 75:BMY 76:TJX 77:MS 78:LMT 79:SYK 80:DE 81:AXP 82:MMC 83:AMT 84:MDLZ 85:PGR 86:GILD 87:LRCX 88:ADP 89:CB 90:ADI 91:VRTX 92:SCHW 93:ETN 94:PANW 95:C 96:REGN 97:CVS 98:MU 99:SNPS 100:BSX 101:ZTS 102:BX 103:FI 104:CME 105:TMUS 106:CI 107:SO 108:EQIX 109:MO 110:KLAC 111:CDNS 112:SLB 113:EOG 114:DUK 115:BDX 116:NOC 117:ITW 118:AON 119:SHW 120:ICE 121:CL 122:CSX 123:MCK 124:PYPL 125:WM 126:TGT 127:CMG 128:APD 129:HUM 130:FDX 131:MPC 132:USB 133:ORLY 134:MCO 135:PSX 136:ROP 137:GD 138:PH 139:ANET 140:MMM 141:APH 142:PXD 143:MSI 144:ABNB 145:AJG 146:FCX 147:PNC 148:TDG 149:NXPI 150:LULU 151:TT 152:CCI 153:EMR 154:MAR 155:HCA 156:NSC 157:WELL 158:ECL 159:PCAR 160:CTAS 161:AZO 162:AIG 163:ADSK 164:NEM 165:SRE 166:MCHP 167:AFL 168:WMB 169:DXCM 170:ROST 171:VLO 172:HLT 173:CPRT 174:CARR 175:GM 176:TFC 177:COF 178:NUE 179:DLR 180:KMB 181:TRV 182:MSCI 183:EW 184:TEL 185:MNST 186:PSA 187:AEP 188:SPG 189:CHTR 190:F 191:MET 192:OKE 193:CNC 194:ADM 195:OXY 196:IQV 197:PAYX 198:CEG 199:DHI 200:HES 201:STZ 202:IDXX 203:EXC 204:O 205:D 206:A 207:BK 208:GIS 209:SYY 210:DOW 211:AMP 212:LHX 213:ALL 214:JCI 215:PCG 216:AME 217:CTSH 218:PRU 219:OTIS 220:KVUE 221:YUM 222:VRSK 223:GWW 224:ODFL 225:FIS 226:IT 227:FAST 228:FTNT 229:KMI 230:EA 231:COR 232:BIIB 233:BKR 234:CSGP 235:XEL 236:PPG 237:HAL 238:RSG 239:DD 240:URI 241:LEN 242:CTVA 243:KDP 244:CMI 245:ROK 246:PEG 247:ED 248:ACGL 249:ON 250:GPN 251:VICI 252:EL 253:KR 254:IR 255:DVN 256:DG 257:MLM 258:VMC 259:CDW 260:HSY 261:KHC 262:FANG 263:EXR 264:PWR 265:CAH 266:FICO 267:GEHC 268:SBAC 269:EFX 270:WEC 271:MPWR 272:WST 273:WTW 274:DLTR 275:MRNA 276:EIX 277:AWK 278:ANSS 279:HPQ 280:RCL 281:XYL 282:TTWO 283:CBRE 284:ZBH 285:FTV 286:KEYS 287:AVB 288:LYB 289:HIG 290:MTD 291:DAL 292:CHD 293:APTV 294:DFS 295:STT 296:WBD 297:RMD 298:WY 299:BR 300:TROW 301:TSCO 302:EBAY 303:HPE 304:GLW 305:DTE 306:ETR 307:MTB 308:ULTA 309:MOH 310:WAB 311:ES 312:HWM 313:TRGP 314:NVR 315:AEE 316:CTRA 317:STE 318:RJF 319:DOV 320:FITB 321:EQR 322:PHM 323:NTAP 324:LH 325:IFF 326:CBOE 327:INVH 328:PPL 329:FE 330:VRSN 331:TDY 332:DRI 333:NDAQ 334:EXPE 335:PTC 336:GRMN 337:IRM 338:GPC 339:STLD 340:VTR 341:BAX 342:CNP 343:EXPD 344:FLT 345:CLX 346:EG 347:BRO 348:AKAM 349:HOLX 350:BALL 351:FDS 352:TYL 353:ARE 354:LVS 355:VLTO 356:ATO 357:FSLR 358:COO 359:WAT 360:CMS 361:BG 362:PFG 363:NTRS 364:MKC 365:AXON 366:CINF 367:HBAN 368:ILMN 369:HUBB 370:J 371:OMC 372:AVY 373:RF 374:SWKS 375:WDC 376:MRO 377:DGX 378:ALGN 379:STX 380:LUV 381:TXT 382:PKG 383:IEX 384:JBHT 385:CCL 386:EPAM 387:WRB 388:LDOS 389:SNA 390:CF 391:EQT 392:MAA 393:LW 394:WBA 395:ALB 396:TER 397:SWK 398:AMCR 399:K 400:DPZ 401:CE 402:BBY 403:ENPH 404:ESS 405:MAS 406:POOL 407:SYF 408:CAG 409:TSN 410:PODD 411:L 412:UAL 413:CFG 414:IP 415:LNT 416:NDSN 417:HST 418:GEN 419:ZBRA 420:MOS 421:LYV 422:LKQ 423:IPG 424:KIM 425:EVRG 426:KEY 427:JKHY 428:TRMB 429:AES 430:TAP 431:ROL 432:SJM 433:MGM 434:APA 435:RVTY 436:VTRS 437:NRG 438:GL 439:TFX 440:BF.B 441:PNR 442:CDAY 443:WRK 444:NI 445:REG 446:FFIV 447:UDR 448:KMX 449:INCY 450:CRL 451:EMN 452:TECH 453:CPT 454:CHRW 455:PEAK 456:CZR 457:QRVO 458:HII 459:AOS 460:ETSY 461:ALLE 462:JNPR 463:MTCH 464:MKTX 465:AIZ 466:PAYC 467:HRL 468:RHI 469:HSIC 470:UHS 471:PNW 472:NWSA 473:WYNN 474:AAL 475:BXP 476:BWA 477:CPB 478:FOXA 479:BBWI 480:TPR 481:GNRC 482:BEN 483:CTLT 484:FRT 485:PARA 486:FMC 487:XRAY 488:IVZ 489:BIO 490:NCLH 491:CMA 492:WHR 493:HAS 494:VFC 495:DVA 496:ZION 497:RL 498:SEE 499:ALK 500:MHK 501:SEDG 502:FOX 503:NWS
Remove rows with missing data¶
data_filename = "SP500_daily_data_1980_to_2023.csv.gz"
notebooks_dir = Path("./../notebooks")
data_file_path = notebooks_dir.parent / "data" / data_filename
print(data_file_path)
../data/SP500_daily_data_1980_to_2023.csv.gz
%time
# passing a function as argument to another function; load all data we got
df = process_data(load_data(file_path=data_file_path, compression="gzip"))
CPU times: user 1 µs, sys: 1 µs, total: 2 µs Wall time: 4.53 µs
df.shape
(3169549, 13)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3169549 entries, 0 to 3169548 Data columns (total 13 columns): # Column Dtype --- ------ ----- 0 ticker object 1 date datetime64[ns] 2 prc float64 3 vol int64 4 close float64 5 low float64 6 high float64 7 open float64 8 price_vol float64 9 intraday_return float64 10 sign_intraday_return int64 11 next_intraday_return float64 12 sign_next_day_return Int64 dtypes: Int64(1), datetime64[ns](1), float64(8), int64(2), object(1) memory usage: 317.4+ MB
df.head()
ticker | date | prc | vol | close | low | high | open | price_vol | intraday_return | sign_intraday_return | next_intraday_return | sign_next_day_return | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AMT | 1992-06-15 | 7.375 | 26700 | 7.375 | 7.375 | 7.500 | 7.375 | 196912.5 | 0.000000 | 0 | 0.000000 | 0 |
1 | AMT | 1992-06-16 | 7.375 | 6300 | 7.375 | 7.375 | 7.500 | 7.375 | 46462.5 | 0.000000 | 0 | -0.016949 | -1 |
2 | AMT | 1992-06-17 | 7.250 | 16400 | 7.250 | 7.250 | 7.375 | 7.375 | 118900.0 | -0.016949 | -1 | 0.000000 | 0 |
3 | AMT | 1992-06-18 | 7.250 | 6900 | 7.250 | 7.250 | 7.375 | 7.250 | 50025.0 | 0.000000 | 0 | 0.016949 | 1 |
4 | AMT | 1992-06-19 | 7.500 | 13700 | 7.500 | 7.250 | 7.500 | 7.375 | 102750.0 | 0.016949 | 1 | 0.000000 | 0 |
df["ticker"].unique().__len__()
501
We have data fo 501 stocks though in our ticker file we got 503 tickers. We obtained the tickers from a public source and their ticker formatting might be different from CRSP
. This is not significant for our constex, however, it is worthwhile making sure the metadata of tradable securities is updated regularly.
Let us now compute for how many days per stock we got data for.¶
Companies and be included and taken off the S&P 500 index. Some startups that were not previously listed might prosper and develop into companies large enough to be included in the index, while other may be outcompeted by others, cease to exist, acquired, split up, or taken private and hence either be excluded and/or delisted. Let us investigate for how many days we got data for each stock. While doing so, we find an interesting detour related to Python performance:
Python is primarily considered an interpreted language. Python code is executed by an interpreter, which reads the code at runtime and executes it line by line. This process is different from compiled languages, where the source code is transformed into machine code or bytecode before execution, typically resulting in an executable file. However, at a more detailed level, Python code is indeed compiled under the hood. More precisely, when Python code is executed, it is compiled into bytecode, which is a lower-level, platform-independent representation of the source code. This bytecode is then interpreted by the Python Virtual Machine (PVM)
, however, compared to a purely compiled language such as C
or C++
, not turned into a standalone executable file. This process is automatic and transparent to the user, making Python feel like a purely interpreted language. Tools and third-party packages do exist that can package Python programs along with an interpreter into a standalone executable, but this is an additional step beyond Python's standard behavior.
The important point is that parsing byte-code through the PVM imposes an overhead which costs time. Hence, Python
is considered "slow". However, you can use C
and C++
code within Python
to leverage performance benefits. This is a common practice for computational heavy tasks where the execution speed of Python
is a bottleneck. Integrating C
or C++
code into Python
can significantly improve the performance of certain operations, especially those that are CPU-bound, such as numerical computations, data processing, and more.
This, however, required more detailed knowlege of the Python compiler
, is not straight forward, and a topic for another repository.
However, that does not mean we cannot speed up our code. Paricularly, we can make use of libraries that are written, at least partially, in C
and available in Python, such as numpy
. As pandas
makes use of numpy
, it is often possible to enjoy better performance, especially when we compute in-momory like we using pandas
. Thus, it is generally good advise for the sake of performance, to "write highl-level code thinking low-level", and the following is meant to demonstrate this.
To compute the number of days we got for each of the S&P500 members, a streight-forward (but slow) method is to loop trough each ticker, filter the data frame according to the ticker, and to compute the number of rows. This will be executed below.
%%time
days_per_ticker = {}
for ticker in tickers:
days_per_ticker[ticker] = df.query("ticker == @ticker").shape[
0
] # takes about 16.8 seconds
# days_per_ticker[ticker] = df[df['ticker'] == ticker].shape[0] # takes about 59.6 seconds and is three times slower, still.
CPU times: user 23 s, sys: 0 ns, total: 23 s Wall time: 23 s
As you see this step took approximately 16.8 seconds on the machine this code was executed on. Making use of the pandas
native .goupby()
method, which is written in C
, and storing the results in a dictionary, achieves the same task in just about 112 ms, i.e, the computation is 150 times, i.e. an order of magnitude
faster.
%%time
days_per_ticker = df.groupby("ticker").size().to_dict();
CPU times: user 115 ms, sys: 0 ns, total: 115 ms Wall time: 115 ms
We now investigate how the length of the history of each stock [in days] is distributed¶
plt.figure(figsize=(7, 7))
plt.hist(list(days_per_ticker.values()), bins=30)
plt.show()
# Counter objects are a part of the collections module in Python's standard library.
# They are specialized dictionary subclasses designed to count hashable objects.
# A Counter is a collection where elements are stored as dictionary keys and their counts are stored
# as dictionary values.
Counter(list(days_per_ticker.values())).most_common(3)[
0
] # Counter(list(days_per_ticker.values())).most_common(3)[0][0] then extracts the number that occurs most often.
# 81 stocks contain 7881 days of data
(7944, 80)
How are the stocks weighted with repect to the one with the longst history in the portfolio?¶
max_days_ticker = max(
days_per_ticker, key=days_per_ticker.get
) # find the ticker with the maximum number of days
max_days = days_per_ticker[
max_days_ticker
] # retrieve the value (number of days) for this ticker
print(
f"The ticker with the maximum number of days is: {max_days_ticker}, with {max_days} days."
)
max_days = max(days_per_ticker.values()) # find the maximum number of days
weights_per_ticker = {
ticker: days / max_days for ticker, days in days_per_ticker.items()
} # Calculate the weight for each ticker, z-transform the weights should you wish to use them in ML applications
weights_per_ticker
The ticker with the maximum number of days is: LEN, with 13157 days.
{'A': 0.5693547161206962, 'AAL': 0.2824352055939804, 'AAPL': 0.603253021205442, 'ABBV': 0.2103823059968078, 'ABNB': 0.05837196929391199, 'ABT': 0.6037850573839021, 'ACGL': 0.4512426845025462, 'ACN': 0.5907121684274531, 'ADBE': 0.6034810367104964, 'ADI': 0.6037850573839021, 'ADM': 0.6037850573839021, 'ADP': 0.5999087937979782, 'ADSK': 0.5292999923994831, 'AEE': 0.555749790985787, 'AEP': 0.6037850573839021, 'AES': 0.5204073877023637, 'AFL': 0.6037850573839021, 'AIG': 0.6037850573839021, 'AIZ': 0.5872159306832865, 'AJG': 0.6027969901953333, 'AKAM': 0.46218742874515467, 'ALB': 0.5702667781409135, 'ALGN': 0.4382458007144486, 'ALK': 0.6037850573839021, 'ALL': 0.5988447214410579, 'ALLE': 0.34057915938283806, 'AMAT': 0.6037090522155507, 'AMCR': 0.2427605077145246, 'AMD': 0.6036330470471992, 'AME': 0.6033290263737934, 'AMGN': 0.6036330470471992, 'AMP': 0.4795166071292848, 'AMT': 0.5665425248916927, 'AMZN': 0.5093106331230524, 'ANET': 0.29429201185680626, 'ANSS': 0.5261837804970738, 'AON': 0.26936231663753135, 'AOS': 0.5493653568442655, 'APA': 0.6037090522155507, 'APD': 0.6037850573839021, 'APH': 0.6034810367104964, 'APTV': 0.1912290035722429, 'ARE': 0.5086265866078893, 'ATO': 0.6036330470471992, 'AVB': 0.4891692635099187, 'AVGO': 0.3213498517899217, 'AVY': 0.6037090522155507, 'AWK': 0.5027741886448278, 'AXON': 0.1264726001368093, 'AXP': 0.6037090522155507, 'AZO': 0.6037850573839021, 'BA': 0.6037850573839021, 'BAC': 0.6037850573839021, 'BALL': 0.03139013452914798, 'BAX': 0.6037850573839021, 'BBWI': 0.04613513718932887, 'BBY': 0.6038610625522536, 'BDX': 0.6037850573839021, 'BEN': 0.6040890780573079, 'BG': 0.5619822147906057, 'BIIB': 0.38458615185832634, 'BIO': 0.7769248308885004, 'BK': 0.6038610625522536, 'BKNG': 0.30584479744622634, 'BKR': 0.487193129132781, 'BLK': 0.5967165767272175, 'BMY': 0.6037850573839021, 'BR': 0.5843277342859314, 'BRO': 0.4841529223987231, 'BSX': 0.6037850573839021, 'BWA': 0.5812875275518735, 'BX': 0.31678954168883483, 'BXP': 0.5078665349243748, 'C': 0.6027209850269818, 'CAG': 0.6037850573839021, 'CAH': 0.560842137265334, 'CARR': 0.07159686858706392, 'CAT': 0.6037850573839021, 'CB': 0.6036330470471992, 'CBOE': 0.2591776240784373, 'CBRE': 0.11066352511970814, 'CCI': 0.5576499201945733, 'CCL': 0.6037090522155507, 'CDAY': 0.1086873907425705, 'CDNS': 0.3474956297028198, 'CDW': 0.20110967545793115, 'CE': 0.5166071292847914, 'CEG': 0.2825112107623318, 'CF': 0.5511894808847002, 'CFG': 0.17732005776392795, 'CHD': 0.6037090522155507, 'CHRW': 0.501102074941096, 'CHTR': 0.4561830204453903, 'CI': 0.6037090522155507, 'CINF': 0.6035570418788477, 'CL': 0.6037850573839021, 'CLX': 0.6037850573839021, 'CMA': 0.6037090522155507, 'CMCSA': 0.6016569126700616, 'CME': 0.5359124420460591, 'CMG': 0.40404347495629706, 'CMI': 0.5437409743862582, 'CMS': 0.6037850573839021, 'CNC': 0.5809835068784678, 'CNP': 0.7176407995743711, 'COF': 0.5570418788477617, 'COO': 0.6033290263737934, 'COP': 0.5590940183932508, 'COR': 0.4849129740822376, 'COST': 0.5403967469787946, 'CPB': 0.6038610625522536, 'CPRT': 0.568974690278939, 'CPT': 0.5825036102454967, 'CRL': 0.5372805350763852, 'CRM': 0.5099186744698639, 'CSCO': 0.6034810367104964, 'CSGP': 0.466823744014593, 'CSX': 0.6037850573839021, 'CTAS': 0.6037090522155507, 'CTLT': 0.18020825416128297, 'CTRA': 0.27399863190696966, 'CTSH': 0.4882572014897013, 'CTVA': 0.08770996427757087, 'CVS': 0.5210154290491753, 'CVX': 0.42517291175799954, 'CZR': 0.2549973398191077, 'D': 0.6037850573839021, 'DAL': 0.5741430417268374, 'DD': 0.5704187884776165, 'DE': 0.6037850573839021, 'DFS': 0.5542296876187581, 'DG': 0.5080185452610777, 'DGX': 0.515695067264574, 'DHI': 0.5365204833928707, 'DHR': 0.6037850573839021, 'DIS': 0.6037850573839021, 'DLR': 0.36672493729573613, 'DLTR': 0.5512654860530516, 'DOV': 0.6037850573839021, 'DOW': 0.5736870107167288, 'DPZ': 0.3725773352587976, 'DRI': 0.547085201793722, 'DTE': 0.6037850573839021, 'DUK': 0.6037090522155507, 'DVA': 0.44409819867751005, 'DVN': 0.6037850573839021, 'DXCM': 0.3579843429353196, 'EA': 0.3443794178004104, 'EBAY': 0.5080945504294292, 'ECL': 0.6037090522155507, 'ED': 0.6037850573839021, 'EFX': 0.6037850573839021, 'EG': 0.009272630538876643, 'EIX': 0.53386030250057, 'EL': 0.5378125712548454, 'ELV': 0.21631070912822073, 'EMN': 0.573915026221783, 'EMR': 0.6037850573839021, 'ENPH': 0.22474728281523143, 'EOG': 0.6037090522155507, 'EPAM': 0.22748346887588355, 'EQIX': 0.44706240024321653, 'EQR': 0.5814395378885764, 'EQT': 0.6037850573839021, 'ES': 0.2752907197689443, 'ESS': 0.5657064680398267, 'ETN': 0.6037850573839021, 'ETR': 0.6037090522155507, 'ETSY': 0.16667933419472525, 'EVRG': 0.13544121000228015, 'EW': 0.45405487573154973, 'EXC': 0.5735350003800258, 'EXPD': 0.6027969901953333, 'EXPE': 0.4234247928859162, 'EXR': 0.3709052215550657, 'F': 0.6041650832256593, 'FANG': 0.2148666109295432, 'FAST': 0.6036330470471992, 'FCX': 0.733753895264878, 'FDS': 0.5253477236452079, 'FDX': 0.6037850573839021, 'FE': 0.4998859922474728, 'FFIV': 0.4698639507486509, 'FI': 0.27460667325378124, 'FICO': 0.27491069392718703, 'FIS': 0.3427073040966786, 'FITB': 0.6034810367104964, 'FLT': 0.48042866914950216, 'FMC': 0.6037090522155507, 'FOX': 0.4114159762863875, 'FOXA': 0.20095766512122826, 'FRT': 0.6038610625522536, 'FSLR': 0.3273542600896861, 'FTNT': 0.26997035798434293, 'FTV': 0.14334574751083073, 'GD': 0.6038610625522536, 'GE': 0.6037090522155507, 'GEHC': 0.018925286919510526, 'GEN': 0.35775632743026525, 'GILD': 0.603253021205442, 'GIS': 0.6037850573839021, 'GL': 0.217374781485141, 'GLW': 0.6038610625522536, 'GM': 0.5752831192521092, 'GNRC': 0.2656380633883104, 'GOOG': 0.37037318537660563, 'GOOGL': 0.1864406779661017, 'GPC': 0.6037850573839021, 'GPN': 0.4380937903777457, 'GRMN': 0.48962529452002734, 'GS': 0.47168807478908564, 'GWW': 0.6037850573839021, 'HAL': 0.6037090522155507, 'HAS': 0.6037850573839021, 'HBAN': 0.6035570418788477, 'HCA': 0.4008512578855362, 'HD': 0.603253021205442, 'HES': 0.3376149578171316, 'HIG': 0.5362164627194649, 'HII': 0.3636087253933267, 'HLT': 0.48658508778596943, 'HOLX': 0.6035570418788477, 'HON': 0.6036330470471992, 'HPE': 0.15611461579387398, 'HPQ': 0.41438017785209397, 'HRL': 0.6037090522155507, 'HSIC': 0.5384966177700083, 'HST': 0.4100478832560614, 'HSY': 0.6037850573839021, 'HUBB': 0.15330242456487042, 'HUM': 0.6037850573839021, 'HWM': 0.12077221251045071, 'IBM': 0.6034050315421449, 'ICE': 0.35243596564566393, 'IDXX': 0.6037090522155507, 'IEX': 0.603253021205442, 'IFF': 0.6037090522155507, 'ILMN': 0.44782245192673104, 'INCY': 0.5349243748574903, 'INTC': 0.6034810367104964, 'INTU': 0.589344075397127, 'INVH': 0.13224899293151934, 'IP': 0.6036330470471992, 'IPG': 0.6037850573839021, 'IQV': 0.11704795926122977, 'IR': 0.6035570418788477, 'IRM': 0.47214410579919436, 'ISRG': 0.4503306224823288, 'IT': 0.6427757087481949, 'ITW': 0.6037850573839021, 'IVZ': 0.3177016037090522, 'J': 0.26700615641863645, 'JBHT': 0.6037090522155507, 'JCI': 0.6037850573839021, 'JKHY': 0.6037090522155507, 'JNJ': 0.6037850573839021, 'JNPR': 0.4687238732233792, 'JPM': 0.6037850573839021, 'K': 0.6037850573839021, 'KDP': 0.10481112715664666, 'KEY': 0.41460819335714827, 'KEYS': 0.39097058599984796, 'KHC': 0.1624990499353956, 'KIM': 0.6038610625522536, 'KLAC': 0.6034810367104964, 'KMB': 0.6037090522155507, 'KMI': 0.39233867903017405, 'KMX': 0.5146309949076537, 'KO': 0.6037850573839021, 'KR': 0.6037850573839021, 'KVUE': 0.012616857946340352, 'L': 0.5079425400927263, 'LDOS': 0.19624534468343846, 'LEN': 1.0, 'LH': 0.5485292999923995, 'LHX': 0.08618986091054191, 'LIN': 0.3017405183552482, 'LKQ': 0.2220110967545793, 'LLY': 0.6037090522155507, 'LMT': 0.5511134757163487, 'LNT': 0.491601428897165, 'LOW': 0.6035570418788477, 'LRCX': 0.6036330470471992, 'LULU': 0.31428137113323706, 'LUV': 0.6033290263737934, 'LVS': 0.3642927719084898, 'LW': 0.136429277190849, 'LYB': 0.2556053811659193, 'LYV': 0.3446834384738162, 'MA': 0.41726837424944896, 'MAA': 0.5723949228547541, 'MAR': 0.5887360340503154, 'MAS': 0.6037850573839021, 'MCD': 0.6037850573839021, 'MCHP': 0.5887360340503154, 'MCK': 0.6021889488485217, 'MCO': 0.4636315269438322, 'MDLZ': 0.2150186212662461, 'MDT': 0.6037090522155507, 'MET': 0.4840009120620202, 'META': 0.2255073344987459, 'MGM': 0.4985178992171468, 'MHK': 0.49798586303868664, 'MKC': 0.8810519115299841, 'MKTX': 0.36634491145397885, 'MLM': 0.5704947936459679, 'MMC': 0.6037850573839021, 'MMM': 0.6037090522155507, 'MNST': 0.334802766588128, 'MO': 0.6035570418788477, 'MOH': 0.39218666869347113, 'MOS': 0.37174127840693166, 'MPC': 0.23903625446530363, 'MPWR': 0.40753971270046363, 'MRK': 0.6035570418788477, 'MRNA': 0.16667933419472525, 'MRO': 0.6036330470471992, 'MS': 0.583567682602417, 'MSCI': 0.25811355172151706, 'MSFT': 0.6034810367104964, 'MSI': 0.5175951964733602, 'MTB': 0.5007220490993387, 'MTCH': 0.2575055103747055, 'MTD': 0.49958197157406703, 'MU': 0.6037090522155507, 'NCLH': 0.20947024397659042, 'NDAQ': 0.3613285703427833, 'NDSN': 0.6037090522155507, 'NEE': 0.25872159306832865, 'NEM': 0.6037850573839021, 'NFLX': 0.41324010032682224, 'NI': 0.6037850573839021, 'NKE': 0.6037850573839021, 'NOC': 0.6037850573839021, 'NOW': 0.36345671505662386, 'NRG': 0.4765524055635783, 'NSC': 0.6037850573839021, 'NTAP': 0.5372805350763852, 'NTRS': 0.6036330470471992, 'NUE': 0.6049251349091739, 'NVDA': 0.4768564262369841, 'NVR': 0.6002128144713841, 'NWS': 0.681994375617542, 'NWSA': 0.2870715208634187, 'NXPI': 0.25636543284943375, 'O': 0.5998327886296269, 'ODFL': 0.5733829900433229, 'OKE': 0.6037850573839021, 'OMC': 0.6037850573839021, 'ON': 0.16728737554153683, 'ORCL': 0.6036330470471992, 'ORLY': 0.5869119100098806, 'OTIS': 0.07159686858706392, 'OXY': 0.6037850573839021, 'PANW': 0.21889488485216996, 'PARA': 0.10899141141597629, 'PAYC': 0.26746218742874517, 'PAYX': 0.6035570418788477, 'PCAR': 0.6038610625522536, 'PCG': 0.6037850573839021, 'PEAK': 0.3123052367560994, 'PEG': 0.6037850573839021, 'PEP': 0.6037850573839021, 'PFE': 0.6034050315421449, 'PFG': 0.5600820855818196, 'PG': 0.6037850573839021, 'PGR': 0.603937067720605, 'PH': 0.6037090522155507, 'PHM': 0.6037850573839021, 'PKG': 0.4574751083073649, 'PLD': 0.48772516531124116, 'PM': 0.4174963897545033, 'PNC': 0.603253021205442, 'PNR': 0.532492209470244, 'PNW': 0.6037850573839021, 'PODD': 0.31823363988751235, 'POOL': 0.5369765144029794, 'PPG': 0.6037850573839021, 'PPL': 0.6037850573839021, 'PRU': 0.44554229687618757, 'PSA': 0.599528767956221, 'PSX': 0.3340427149046135, 'PTC': 0.5418408451774721, 'PWR': 0.5365204833928707, 'PXD': 0.5050543436953713, 'PYPL': 0.17389982518811278, 'QCOM': 0.6035570418788477, 'QRVO': 0.17207570114767803, 'RCL': 0.5869879151782321, 'REG': 0.5765752071140837, 'REGN': 0.6037090522155507, 'RF': 0.47450026601808926, 'RHI': 0.6021129436801702, 'RJF': 0.6037850573839021, 'RL': 0.5327202249752984, 'RMD': 0.46378353728053506, 'ROK': 0.6034050315421449, 'ROL': 0.6044691038990652, 'ROP': 0.5194953256821464, 'ROST': 0.6037850573839021, 'RSG': 0.48772516531124116, 'RTX': 0.12898077069240707, 'RVTY': 0.012008816599528767, 'SBAC': 0.4693319145701908, 'SBUX': 0.6030250057003876, 'SCHW': 0.3449114539788706, 'SEDG': 0.16774340655164552, 'SEE': 0.6034050315421449, 'SHW': 0.6036330470471992, 'SJM': 0.7606597248612905, 'SLB': 0.6037850573839021, 'SNA': 0.6037850573839021, 'SNPS': 0.6036330470471992, 'SO': 0.6037850573839021, 'SPG': 0.5836436877707684, 'SPGI': 0.14684198525499734, 'SRE': 0.49828988371209243, 'STE': 0.5289959717260774, 'STLD': 0.5189632895036862, 'STT': 0.5790833776696815, 'STX': 0.4832408603785057, 'STZ': 0.8055027741886448, 'SWK': 0.6037850573839021, 'SWKS': 0.4114919814547389, 'SYF': 0.18020825416128297, 'SYK': 0.5056623850421829, 'SYY': 0.6034810367104964, 'T': 0.6031770160370905, 'TAP': 0.6661092954320894, 'TDG': 0.34050315421448657, 'TDY': 0.5410047883256062, 'TECH': 0.6037090522155507, 'TEL': 0.43376149578171314, 'TER': 0.6035570418788477, 'TFC': 0.3738694231207722, 'TFX': 0.6037090522155507, 'TGT': 0.5111347571634871, 'TJX': 0.6037090522155507, 'TMO': 0.6037090522155507, 'TMUS': 0.2042258873603405, 'TPR': 0.2955080945504294, 'TRGP': 0.3090370145169872, 'TRMB': 0.6036330470471992, 'TROW': 0.6034810367104964, 'TRV': 0.4701679714220567, 'TSCO': 0.5679106179220187, 'TSLA': 0.2583415672265714, 'TSN': 0.5011780801094474, 'TT': 0.3194497225811355, 'TTWO': 0.5093106331230524, 'TXN': 0.6036330470471992, 'TXT': 0.6037850573839021, 'TYL': 0.6024929695219275, 'UAL': 0.460211294368017, 'UDR': 0.6038610625522536, 'UHS': 0.6037090522155507, 'ULTA': 0.30956905069544727, 'UNH': 0.603253021205442, 'UNP': 0.6037850573839021, 'UPS': 0.4615793873983431, 'URI': 0.4978338527019837, 'USB': 0.4499505966405716, 'V': 0.510602720985027, 'VFC': 0.6037850573839021, 'VICI': 0.11309569050695448, 'VLO': 0.603253021205442, 'VLTO': 0.0047883256061412175, 'VMC': 0.6037850573839021, 'VRSK': 0.2722505130348864, 'VRSN': 0.4954016873147374, 'VRTX': 0.6034050315421449, 'VTR': 0.49106939271870487, 'VTRS': 0.059588051987535154, 'VZ': 0.4492665501254085, 'WAB': 0.5809075017101163, 'WAT': 0.5378125712548454, 'WBA': 0.1721517063160295, 'WBD': 0.21099034734361938, 'WDC': 0.6035570418788477, 'WEC': 0.6031010108687391, 'WELL': 0.09721061032150186, 'WFC': 0.6037850573839021, 'WHR': 0.6037850573839021, 'WM': 0.4629474804286691, 'WMB': 0.6037850573839021, 'WMT': 0.6036330470471992, 'WRB': 0.30158850801854525, 'WRK': 0.16257505510374706, 'WST': 0.6022649540168732, 'WTW': 0.4052595576499202, 'WY': 0.6038610625522536, 'WYNN': 0.40518355248156873, 'XEL': 0.5454130880899901, 'XOM': 0.4605153150414228, 'XRAY': 0.6036330470471992, 'XYL': 0.23257581515543058, 'YUM': 0.5017101162879076, 'ZBH': 0.1628030706088014, 'ZBRA': 0.6037090522155507, 'ZION': 0.603253021205442, 'ZTS': 0.20878619746142738}
Alternatively, we could define a start and end date ourselves and make sure to select only those stocks with a densely populated history.¶
Densely here means that the stocks should have the same number of data. This ensures stocks thate were recently taken in are not selected as they do not contain enough data
selected_start_date = pd.Timestamp(2012, 1, 1)
selected_end_date = pd.Timestamp(2022, 12, 31)
df_filtered = df[
(df["date"] >= selected_start_date) & (df["date"] <= selected_end_date)
]
df_filtered.groupby(
"ticker"
).size().value_counts() # 374 out of the 500 stocks are of the desired duration
2768 374 2769 14 2767 4 2770 3 2771 3 ... 2707 1 130 1 2772 1 2372 1 2497 1 Name: count, Length: 96, dtype: int64
If we wanted to just go for the mode directly, we could have achieved this by
mode_size = df_filtered.groupby("ticker").size().mode()[0]
mode_size
2768
And counted the number of stocks of that length using
df_common_size = (
df_filtered.groupby("ticker")
.filter(lambda x: len(x) == mode_size)
.reset_index(drop=True)
)
df_common_size["ticker"].unique().__len__()
374
If we selected only those stocks that have an equal amount of days between our start and end day, we have to reduce our universe from 500 stocks to 374. This is a significant reduction that one should be sure to afford.
As an alternative way, we appreaciate the different length of the data and conduct the pattern analysis for each of them separately.
How does each stock evolve in time?¶
We limit ourselves to onem year of data to see how each of the stocks in the portfolio performed relative to their starting price
def add_normalized_price(df: pd.DataFrame) -> pd.DataFrame:
df["first_price_indicator"] = np.where(df.index == 0, 1, 0)
df["first_price_value"] = df["first_price_indicator"] * df["close"]
df["first_price_value"].replace(to_replace=0, method="ffill", inplace=True)
df["normalized_price"] = df["close"] / df["first_price_value"]
df.drop(columns=["first_price_indicator", "first_price_value"], inplace=True)
return df
# implementation using a multi-indexed data frame
result = df_common_size.set_index(["ticker", "date"]).join(
df_common_size.groupby("ticker").first().add_prefix("first_")
) # dg.set_index(['Date','ListingId']) will be equivalent to the vectorized version
result["normalized_price"] = result["close"] / result["first_close"]
# plotting the data. Note you can select a nunber of stocks via the variable "counter" as well.
selected_start_date = pd.Timestamp(2022, 1, 1)
selected_end_date = pd.Timestamp(2022, 12, 31)
df_filtered = df[
(df["date"] >= selected_start_date) & (df["date"] <= selected_end_date)
]
df_filtered
mode_size = df_filtered.groupby("ticker").size().mode()[0]
df_common_size = (
df_filtered.groupby("ticker")
.filter(lambda x: len(x) == mode_size)
.reset_index(drop=True)
)
result = df_common_size.set_index(["ticker", "date"]).join(
df_common_size.groupby("ticker").first().add_prefix("first_")
) # dg.set_index(['Date','ListingId']) will be equivalent to the vectorized version
result["normalized_price"] = result["close"] / result["first_close"]
plt.figure(figsize=(10, 6))
counter = 0
for ticker, data in result.groupby(level="ticker"):
plt.plot(
data.index.get_level_values("date"), data["normalized_price"], label=ticker
)
# counter += 1
# if counter == 20:
# break
# plt.legend(title='Ticker', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title("Normalised price by date for the selected stocks")
plt.xlabel("date")
plt.ylabel("normalized price")
plt.tight_layout()
plt.show()
We see that the stock prices are not adapted to stock splits. As we work with intraday returns, we get away dealing with price-adjustments which, strictly speaking, is a topic in its own right. However, we do see the investment universe evolved, seemingly randomly, and there were constituents that performed positively, neutral, and negatively.
Hence, selecting one stock in hindsight, and evaluating its buy-and-hold performance, is subject to bias. We will test in the following, how the candlestick patterns perform for the investment universe, to see whether they allow a a portfolion to be actively managed in a long-short fashion.
Candlestick analysis¶
We are now in a position to analyse the whole investment universe. Unfortunately, we now need to deal with a problem we so far go around with: "Big data analysis" comes with "big computational resources". Recall, we need to:
- Loading data to memory coming from 500 stocks, with a history of up to 40 years,
- Making the pattern recogniton logic act on them, where:
- for every date, we create up to 61 new rows, where 61 is the number of candlestick patterns we are able to identify using
talib
.
- for every date, we create up to 61 new rows, where 61 is the number of candlestick patterns we are able to identify using
Unfortunately, this exceeds the memory resources of a standard workstation or laptop.
In the following we hence limit ourselves to two years of data and outline how to proceed the analysis, but need to leave considering a wider time-interval to the interested reader whos has a more powerful machine available.
# the path to the data file is the same as for the notbook discussing the single-stock case.
data_filename = "SP500_daily_data_1980_to_2023.csv.gz"
notebooks_dir = Path("./../notebooks")
data_file_path = notebooks_dir.parent / "data" / data_filename
df = process_data(
load_data(
data_file_path,
selected_start_date=pd.Timestamp(2019, 1, 1),
selected_end_date=pd.Timestamp(2022, 12, 31),
)
)
%%time
cs_signals_df = cs_pattern_recognition(df=df)
CPU times: user 4.65 s, sys: 5.84 s, total: 10.5 s Wall time: 10.5 s
%%time
performance_metrics = cs_performance(cs_signals_df)
CPU times: user 87.1 ms, sys: 0 ns, total: 87.1 ms Wall time: 86.9 ms
# plot all patterns, ranked by number of instances
plot_cs_performance(
df=performance_metrics,
criterion="total_instances",
title_suffix="across the whole data set.",
)
# plot the patterns, ranked by number of instances, with a true-positive rate >50%.
plot_cs_performance(
df=performance_metrics.query("ci_lower > 0.5").sort_values(
by="total_instances", ascending=False
),
criterion="total_instances",
title_suffix="with ci_lower > 50%.",
)
Notable this time, appreciating more data, we indeed isntances where the lower part of the confidence interval is greater than 50%. This tells us we are 95% sure that these patterns indeed correctly predict the next day's intraday return. Conversely, we are now also in a position to identify counter-signals. These are instances where the upper part of the confidence interval is below the 50% threshold. Hence, it is indicated to take action for these signals in opposite direction of what they suggest, i.e. to take them as contrarian signals.
Let us visualise these reults in the following.
performance_metrics.query("ci_lower > 0.5").sort_values(
by=["ci_lower"], ascending=False
)
TP | FP | total_instances | precision | center | margin | ci_upper | ci_lower | TP_wilson | |
---|---|---|---|---|---|---|---|---|---|
candle | |||||||||
CDL3LINESTRIKE | 254 | 190 | 444 | 0.572072 | 0.571454 | 0.045829 | 0.617282 | 0.525625 | 0.571454 |
CDLINVERTEDHAMMER | 3030 | 2742 | 5772 | 0.524948 | 0.524931 | 0.012879 | 0.537810 | 0.512053 | 0.524931 |
CDLGRAVESTONEDOJI | 2958 | 2733 | 5691 | 0.519768 | 0.519755 | 0.012976 | 0.532731 | 0.506779 | 0.519755 |
CDLUNIQUE3RIVER | 409 | 348 | 757 | 0.540291 | 0.540087 | 0.035413 | 0.575500 | 0.504674 | 0.540087 |
CDLCOUNTERATTACK | 342 | 287 | 629 | 0.543720 | 0.543455 | 0.038807 | 0.582262 | 0.504647 | 0.543455 |
CDLLONGLEGGEDDOJI | 33980 | 32928 | 66908 | 0.507862 | 0.507861 | 0.003788 | 0.511649 | 0.504073 | 0.507861 |
CDLSEPARATINGLINES | 1184 | 1076 | 2260 | 0.523894 | 0.523853 | 0.020573 | 0.544426 | 0.503280 | 0.523853 |
CDLHAMMER | 5576 | 5303 | 10879 | 0.512547 | 0.512543 | 0.009391 | 0.521934 | 0.503152 | 0.512543 |
CDLRICKSHAWMAN | 25188 | 24453 | 49641 | 0.507403 | 0.507403 | 0.004398 | 0.511800 | 0.503005 | 0.507403 |
CDLMATCHINGLOW | 3900 | 3691 | 7591 | 0.513766 | 0.513759 | 0.011241 | 0.525000 | 0.502519 | 0.513759 |
CDLDOJI | 34238 | 33455 | 67693 | 0.505783 | 0.505783 | 0.003766 | 0.509549 | 0.502017 | 0.505783 |
CDLSTICKSANDWICH | 408 | 352 | 760 | 0.536842 | 0.536657 | 0.035362 | 0.572019 | 0.501295 | 0.536657 |
performance_metrics.query("ci_upper < 0.5").sort_values(
by=["ci_upper"], ascending=False
)
TP | FP | total_instances | precision | center | margin | ci_upper | ci_lower | TP_wilson | |
---|---|---|---|---|---|---|---|---|---|
candle | |||||||||
CDLSPINNINGTOP | 53080 | 53849 | 106929 | 0.496404 | 0.496404 | 0.002997 | 0.499401 | 0.493408 | 0.496404 |
CDLMARUBOZU | 11173 | 11518 | 22691 | 0.492398 | 0.492399 | 0.006504 | 0.498904 | 0.485895 | 0.492399 |
CDLBELTHOLD | 36943 | 37701 | 74644 | 0.494923 | 0.494923 | 0.003587 | 0.498509 | 0.491336 | 0.494923 |
CDLSHOOTINGSTAR | 3697 | 3929 | 7626 | 0.484789 | 0.484797 | 0.011214 | 0.496011 | 0.473583 | 0.484797 |
CDLLONGLINE | 48672 | 50428 | 99100 | 0.491140 | 0.491141 | 0.003112 | 0.494253 | 0.488028 | 0.491141 |
CDLSHORTLINE | 28193 | 29360 | 57553 | 0.489862 | 0.489862 | 0.004084 | 0.493946 | 0.485778 | 0.489862 |
CDLTRISTAR | 320 | 383 | 703 | 0.455192 | 0.455436 | 0.036713 | 0.492148 | 0.418723 | 0.455436 |
CDLCLOSINGMARUBOZU | 35159 | 37383 | 72542 | 0.484671 | 0.484672 | 0.003637 | 0.488308 | 0.481035 | 0.484672 |
CDLDARKCLOUDCOVER | 1093 | 1259 | 2352 | 0.464711 | 0.464768 | 0.020140 | 0.484909 | 0.444628 | 0.464768 |
CDLPIERCING | 862 | 1008 | 1870 | 0.460963 | 0.461043 | 0.022570 | 0.483612 | 0.438473 | 0.461043 |
CDL2CROWS | 234 | 299 | 533 | 0.439024 | 0.439461 | 0.041982 | 0.481443 | 0.397479 | 0.439461 |
CDLENGULFING | 17525 | 19420 | 36945 | 0.474354 | 0.474356 | 0.005091 | 0.479448 | 0.469265 | 0.474356 |
CDLEVENINGDOJISTAR | 444 | 559 | 1003 | 0.442672 | 0.442891 | 0.030681 | 0.473572 | 0.412209 | 0.442891 |
CDL3OUTSIDE | 8330 | 9687 | 18017 | 0.462341 | 0.462349 | 0.007279 | 0.469629 | 0.455070 | 0.462349 |
CDLEVENINGSTAR | 1317 | 1601 | 2918 | 0.451337 | 0.451401 | 0.018044 | 0.469444 | 0.433357 | 0.451401 |
plot_cs_performance(
df=performance_metrics, criterion="TP_wilson", plot_performance=True
)
Based on this analysis, we now name signals and contrarian signals:
performance_metrics.query("ci_lower > 0.5").index # signals
Index(['CDLDOJI', 'CDLLONGLEGGEDDOJI', 'CDLRICKSHAWMAN', 'CDLHAMMER', 'CDLMATCHINGLOW', 'CDLINVERTEDHAMMER', 'CDLGRAVESTONEDOJI', 'CDLSEPARATINGLINES', 'CDLSTICKSANDWICH', 'CDLUNIQUE3RIVER', 'CDLCOUNTERATTACK', 'CDL3LINESTRIKE'], dtype='object', name='candle')
performance_metrics.query("ci_upper < 0.5").index # anti signals
Index(['CDLSPINNINGTOP', 'CDLLONGLINE', 'CDLBELTHOLD', 'CDLCLOSINGMARUBOZU', 'CDLSHORTLINE', 'CDLENGULFING', 'CDLMARUBOZU', 'CDL3OUTSIDE', 'CDLSHOOTINGSTAR', 'CDLEVENINGSTAR', 'CDLDARKCLOUDCOVER', 'CDLPIERCING', 'CDLEVENINGDOJISTAR', 'CDLTRISTAR', 'CDL2CROWS'], dtype='object', name='candle')
If you want, you can test strategies that contain only those candlestick patterns that have proven to be profitable and/or those which manifsted themselves as anti-signals.
Also you now can implement your own Machine-Learning logics to see whether you can come up with your own logic. Also, you can run the logic on a more potent machine, to see how the precision and confidence intervals change per candlestick pattern.
Applying candlestick analysis across the S&P 500 universe¶
Unfortunately, the data provider does not have OHLC data for the S&P 500 index. At the time of writing, an inquiry is still ongoing. In the following, we illustrate how we obtain a synthetic performance reference nontheless, namely by computing the mean intraday return at any day acroos all the oniverse assuming equal weights. This shoudl serve as an approximate solution that works with the data at hand.
# synthetic S&P 500 intraday performacne
df_reference_strategy = (
df[["ticker", "date", "intraday_return"]]
.pivot_table(index="date", columns="ticker")
.mean(axis=1)
)
df_reference_strategy = df_reference_strategy.rename("intraday_return").reset_index()
df_reference_strategy["account_curve"] = (
1 + df_reference_strategy["intraday_return"]
).cumprod()
df_reference_strategy["cumsumret"] = df_reference_strategy["intraday_return"].cumsum()
df_reference_strategy.plot(x="date", y="account_curve", figsize=(8, 8))
plt.show()
compute_trading_strategy_performance(df=df_reference_strategy, verbose=True);
Annualised strategy return [%]: 0.0637 Annualised strategy standard deviation of returns [%]: 0.1624 Sharpe ratio of strategy: 0.3922
However, we can also opt for a method that compares the buy-and hold approach against employign the naive candlestick strategy. Recall the naive candlestick approach was to sum up the signal for each day across all the candlestick patterns. We can then compare the Sharpe Ration for both, the buy-and-hold approach, as well as the active candlestick approach and compare which of them delivers higher risk-adjusted returns, if at all. This is carrie dout in the below, whearas we slightly modified the approach of the single-stock method from notebook 2.
Analysing Sharpe Ratios for a passive and acive trading strategy for the S&P 500 universe¶
We now turn our focus to comparing the Sharpe Ratios from both, passive and active trading strategies, examining their distribution through histograms, cumulative distribution functions, and box plots. Additionally, we employ specific functions for a detailed statistical analysis.
Precisely, analyse_Sharpe_Ratios_for_active_and_passive_strategies()
visualizes their distributions, analyze_sharpe_ratios()
delves into their statistical characteristics, and compare_sharpe_ratios()
statistically determines if the active strategy's Sharpe Ratios significantly outperform those of a reference strategy.
def analyse_Sharpe_Ratios_for_active_and_passive_strategies(
SR_buy_and_hold: np.array, SR_naive_cs: np.array
) -> None:
# determine the combined range of both Sharpe Ratios
all_ratios = np.concatenate((SR_buy_and_hold, SR_naive_cs))
min_edge = all_ratios.min()
max_edge = all_ratios.max()
bins = np.linspace(
min_edge, max_edge, 40
) # 40 equal-width bins across the full range
# compute empirical CDF for SR_buy_and_hold
sorted_SR_buy_and_hold = np.sort(SR_buy_and_hold)
yvals_buy_and_hold = np.arange(1, len(sorted_SR_buy_and_hold) + 1) / float(
len(sorted_SR_buy_and_hold)
)
# compute empirical CDF for SR_naive_cs
sorted_SR_naive_cs = np.sort(SR_naive_cs)
yvals_naive_cs = np.arange(1, len(sorted_SR_naive_cs) + 1) / float(
len(sorted_SR_naive_cs)
)
# plotting
fig, axs = plt.subplots(3, 1, figsize=(10, 18))
# histograms
axs[0].hist(SR_buy_and_hold, bins=bins, alpha=0.5, label="Buy and hold")
axs[0].hist(SR_naive_cs, bins=bins, alpha=0.5, label="Naive CS")
axs[0].set_title("Comparison of Sharpe Ratios: Buy and hold vs. Naive CS")
axs[0].set_xlabel("Sharpe Ratio")
axs[0].set_ylabel("Frequency")
axs[0].legend()
# empirical CDFs
axs[1].plot(
sorted_SR_buy_and_hold,
yvals_buy_and_hold,
label="Buy and hold",
marker=".",
linestyle="none",
)
axs[1].plot(
sorted_SR_naive_cs,
yvals_naive_cs,
label="Naive CS",
marker=".",
linestyle="none",
)
axs[1].set_title("Empirical CDF of Sharpe Ratios")
axs[1].set_xlabel("Sharpe Ratio")
axs[1].set_ylabel("CDF")
axs[1].legend()
axs[1].grid(True)
# box Plots
axs[2].boxplot([SR_buy_and_hold, SR_naive_cs], labels=["Buy and hold", "Naive CS"])
axs[2].set_title("Box Plot of Sharpe Ratios")
axs[2].set_ylabel("Sharpe Ratio")
plt.tight_layout()
plt.show()
return None
def analyze_sharpe_ratios(SR_buy_and_hold: np.array, SR_naive_cs: np.array) -> None:
print("Buy and Hold Strategy:")
print(f"Mean Sharpe Ratio: {np.mean(SR_buy_and_hold):.4f}")
print(f"Median Sharpe Ratio: {np.median(SR_buy_and_hold):.4f}")
print(f"Kurtosis: {stats.kurtosis(SR_buy_and_hold):.4f}")
print(f"Skewness: {stats.skew(SR_buy_and_hold):.4f}\n")
print("Naive Candlestick Strategy:")
print(f"Mean Sharpe Ratio: {np.mean(SR_naive_cs):.4f}")
print(f"Median Sharpe Ratio: {np.median(SR_naive_cs):.4f}")
print(f"Kurtosis: {stats.kurtosis(SR_naive_cs):.4f}")
print(f"Skewness: {stats.skew(SR_naive_cs):.4f}")
return None
def compare_sharpe_ratios(SR_buy_and_hold: np.array, SR_naive_cs: np.array) -> None:
t_stat, p_value = stats.ttest_ind(
SR_naive_cs, SR_buy_and_hold, alternative="greater"
)
print(f"t-statistic: {t_stat}")
print(f"p-value: {p_value}")
# Interpret the p-value
if p_value < 0.05:
print(
"The naive candlestick strategy has significantly greater Sharpe Ratios than the buy-and-hold strategy at the 5% significance level."
)
else:
print(
"There is no significant difference in Sharpe Ratios in favour of the naive candlestick strategy over the buy-and-hold strategy at the 5% significance level."
)
return None
First approach: Taking into account all candlestick patterns¶
Now, we compute the Sharpe Ratios for the active, as well as passive (reference) strategy.
%%time
# we should loop through all of the tickers to create trading signals for each stock
StrategyPerformance = namedtuple(
"StrategyPerformance", ["SR_buy_and_hold", "SR_naive_cs"]
)
naive_cs_vs_buy_and_hold_performance = {}
for ticker in tickers:
df_single_stock = df[df["ticker"] == ticker]
cs_single_stock_signals_df = cs_signals_df[cs_signals_df["ticker"] == ticker]
trading_signal = (
cs_single_stock_signals_df.query("cs_pattern != 0")
.pivot_table(index="date", columns="candle", values="cs_pattern", aggfunc="sum")
.sum(axis=1)
.loc[lambda x: x != 0]
)
performance_trading_signals = (
df_single_stock[
df_single_stock["date"].isin(
[date + pd.DateOffset(days=1) for date in trading_signal.index]
)
][["date", "intraday_return"]]
.assign(account_curve=lambda x: (1 + x["intraday_return"]).cumprod())
.assign(cumsumret=lambda x: x["intraday_return"].cumsum())
.assign(time_between_signals=lambda x: x["date"].diff().dt.days)
)
(_, _, SR_buy_and_hold) = compute_trading_strategy_performance(df=df_single_stock)
(_, _, SR_naive_cs) = compute_trading_strategy_performance(
df=performance_trading_signals
)
naive_cs_vs_buy_and_hold_performance[ticker] = StrategyPerformance(
SR_buy_and_hold=SR_buy_and_hold, SR_naive_cs=SR_naive_cs
)
CPU times: user 50 s, sys: 755 µs, total: 50 s Wall time: 50 s
SR_buy_and_hold = np.array(
[
performance.SR_buy_and_hold
for performance in naive_cs_vs_buy_and_hold_performance.values()
if not np.isnan(performance.SR_buy_and_hold)
]
)
SR_naive_cs = np.array(
[
performance.SR_naive_cs
for performance in naive_cs_vs_buy_and_hold_performance.values()
if not np.isnan(performance.SR_naive_cs)
]
)
analyse_Sharpe_Ratios_for_active_and_passive_strategies(
SR_buy_and_hold=SR_buy_and_hold, SR_naive_cs=SR_naive_cs
)
analyze_sharpe_ratios(SR_buy_and_hold=SR_buy_and_hold, SR_naive_cs=SR_naive_cs)
Buy and Hold Strategy: Mean Sharpe Ratio: 0.2293 Median Sharpe Ratio: 0.2298 Kurtosis: 1.6132 Skewness: -0.3861 Naive Candlestick Strategy: Mean Sharpe Ratio: 0.2325 Median Sharpe Ratio: 0.2617 Kurtosis: 1.7315 Skewness: -0.2999
Notably, both the mean and the median Sharpe Ratio for the naive candlestick approach (active trading strategy) are higher than for the passive (buy-and-hold) approach. Interestingly, also the skewness is slighly lower for the active approach. Recall that skewness values close to zero suggest a symmetrical distribution of returns around the mean. For the buy-and-hold strategy, a skewness of -0.3861 indicates a skew to the left, suggesting a distribution with a fatter left tail indicating more frequent negative extreme returns than positive ones. The candlestick approach's skewness of -0.2999 also indicates a leftward skew but to a lesser extent, suggesting a slightly more symmetric distribution of returns around the mean compared to the buy-and-hold strategy.
Unfortunately, the extra returns obtained form the candlestick approach do not come for free: They are also more risk as indicated by the larger kurtosis. Also note we did not assume any transation costs. For a trading company, a market maker, or a large bank, which are in a position to negotiate lower transaction costs than retail traders, these are less of an issue. However, an active approach involving daily transactions for a retail trader is disadvantageous, to say the least. The active candlestick strategy hence appears more suitable for risk seeking speculatnts who are in pursuit of "high-risk-high-return" bets.
However, it is noted that the outperformace of mean return, median return, and a smaller skew for the active candlestick approach are indeed interesting observations. In order to determine whether higher Sharpe Ratios for the candlestick approach are statistically significantly greater than the ones for the passive strategy, we perform a one-sided t-test.
compare_sharpe_ratios(SR_buy_and_hold, SR_naive_cs)
t-statistic: 0.1020279443073558 p-value: 0.4593775985116076 There is no significant difference in Sharpe Ratios in favour of the naive candlestick strategy over the buy-and-hold strategy at the 5% significance level.
Second approach: Filter only statisticaly significant cs patterns¶
positive_signals = performance_metrics.query("ci_lower > 0.5").index # signals
counter_signals = performance_metrics.query(
"ci_upper < 0.5"
).index # anti-signals/contrarians
%%time
StrategyPerformance = namedtuple(
"StrategyPerformance", ["SR_buy_and_hold", "SR_naive_cs"]
)
naive_cs_vs_buy_and_hold_performance = {}
for ticker in tickers:
df_single_stock = df[df["ticker"] == ticker]
cs_single_stock_signals_df = cs_signals_df[cs_signals_df["ticker"] == ticker]
# create a copy for modification
filtered_signals_df = cs_single_stock_signals_df.copy()
# apply the filter directly to this copy
filter_mask = filtered_signals_df.index.get_level_values("candle").isin(
positive_signals.union(counter_signals)
)
filtered_signals_df = filtered_signals_df.loc[filter_mask]
# adjust 'cs_pattern' by multiplying by -1 for counter signals
counter_signals_mask = filtered_signals_df.index.get_level_values("candle").isin(
counter_signals
)
filtered_signals_df.loc[counter_signals_mask, "cs_pattern"] *= -1
trading_signal = (
filtered_signals_df.query("cs_pattern != 0")
.pivot_table(index="date", columns="candle", values="cs_pattern", aggfunc="sum")
.sum(axis=1)
.loc[lambda x: x != 0]
)
performance_trading_signals = (
df_single_stock[
df_single_stock["date"].isin(
[date + pd.DateOffset(days=1) for date in trading_signal.index]
)
][["date", "intraday_return"]]
.assign(account_curve=lambda x: (1 + x["intraday_return"]).cumprod())
.assign(cumsumret=lambda x: x["intraday_return"].cumsum())
.assign(time_between_signals=lambda x: x["date"].diff().dt.days)
)
(_, _, SR_buy_and_hold) = compute_trading_strategy_performance(df=df_single_stock)
(_, _, SR_naive_cs) = compute_trading_strategy_performance(
df=performance_trading_signals
)
naive_cs_vs_buy_and_hold_performance[ticker] = StrategyPerformance(
SR_buy_and_hold=SR_buy_and_hold, SR_naive_cs=SR_naive_cs
)
CPU times: user 48.8 s, sys: 9.18 ms, total: 48.8 s Wall time: 48.9 s
SR_buy_and_hold = np.array(
[
performance.SR_buy_and_hold
for performance in naive_cs_vs_buy_and_hold_performance.values()
if not np.isnan(performance.SR_buy_and_hold)
]
)
SR_naive_cs = np.array(
[
performance.SR_naive_cs
for performance in naive_cs_vs_buy_and_hold_performance.values()
if not np.isnan(performance.SR_naive_cs)
]
)
analyze_sharpe_ratios(SR_buy_and_hold=SR_buy_and_hold, SR_naive_cs=SR_naive_cs)
Buy and Hold Strategy: Mean Sharpe Ratio: 0.2293 Median Sharpe Ratio: 0.2298 Kurtosis: 1.6132 Skewness: -0.3861 Naive Candlestick Strategy: Mean Sharpe Ratio: 0.2822 Median Sharpe Ratio: 0.3150 Kurtosis: 0.7447 Skewness: -0.2592
compare_sharpe_ratios(SR_buy_and_hold, SR_naive_cs)
t-statistic: 1.6637194409255192 p-value: 0.0482428136676386 The naive candlestick strategy has significantly greater Sharpe Ratios than the buy-and-hold strategy at the 5% significance level.
Recall from above, the naive Candlestick strategy utilizing all patterns was characterised by the following performacne metrics:
Mean Sharpe Ratio: 0.2325
Median Sharpe Ratio: 0.2617
Kurtosis: 1.7315
Skewness: -0.2999
Conclusion¶
Upon filtering the candlestick signals and contrarian signals that were found to be indicating a price move on the next day at the 5% significance level, we could improve the all performance statistics, based on the Sharpe Ratio, in all four categories examined. The full set consists of 61 candlestick patterns, whereas the filtered approach consists of 12 signals and 15 counter signals.
The signals were identified at the 5% significance level to be:
Index(['CDLDOJI', 'CDLLONGLEGGEDDOJI', 'CDLRICKSHAWMAN', 'CDLHAMMER',
'CDLMATCHINGLOW', 'CDLINVERTEDHAMMER', 'CDLGRAVESTONEDOJI',
'CDLSEPARATINGLINES', 'CDLSTICKSANDWICH', 'CDLUNIQUE3RIVER',
'CDLCOUNTERATTACK', 'CDL3LINESTRIKE'],
dtype='object', name='candle')
The contrarian signals were identified at the 5% significance level to be:
Index(['CDLSPINNINGTOP', 'CDLLONGLINE', 'CDLBELTHOLD', 'CDLCLOSINGMARUBOZU',
'CDLSHORTLINE', 'CDLENGULFING', 'CDLMARUBOZU', 'CDL3OUTSIDE',
'CDLSHOOTINGSTAR', 'CDLEVENINGSTAR', 'CDLDARKCLOUDCOVER', 'CDLPIERCING',
'CDLEVENINGDOJISTAR', 'CDLTRISTAR', 'CDL2CROWS'],
dtype='object', name='candle')
Notably, the filtered candlestick approach outperforms the naive buy strategy in all four performance categories investigated. Moreover, a one-sided t-test
revealed that the Sharpe Ratios obtained by the filtered candlestick approach is greated than those obtained by the naive buy approach at the 5% level.
In further research, one could run the very same code on a more potent machine and simply select a longer data range when loading in the data, to see whether the results reported here still hold. Also, one could attempt an expanding-window approach regarding the considered time-frame to investigate how performance changes in time and whether there are stocks for which the candlestick approach works particularly good or bad. The data source considered for this analysis were the past two years for all S&P 500 components, although for some stocks, there exists data dating back to the 1980ies. Assuming a densely populated data this equates to an upper boundary of 20 000 years of daily stock OHLC data.
It remains, however, that the level of analysis carried out and presented here, required access to proprietary data and significant computing power. These could be, for example, the High Performance Computing (HPC) facilities at Imperial College, or a potent private institution. Moreover, an active trading approach is predominantly aimed at players like large hedge funds and investment banks which still have proprietary trading teams, and that are in a position to negotiate low transaction costs. They should also be potent enough to observe, and act, on data streams across the entire S&P 500 universe. An extention to any other index, such as the STOXX600 or any Asian index is easily diable using the existing code.
For fund managers, with a more passive approach, the presented analysis can be interesting to optimise their entry points opon which to accummulate or offload positions.
For brokers, the presented analysis is useful to craft an arrival strategies to which to adopt their execution logics based on an oppinion whether a stock goes up or down. In case of no signal on a particularly stock, one then would simply fall back to a default behaviour.
END¶
Appendix¶
ML approach¶
The author cannot run the ML approach as we do not have enough memory available to load the required history of the stocks. ML methods are inherently data hungy, so loading just two years of data per stock will not be enough for meaningful results. Also, we cannot mix the history of one stock with the history of another, as financial data is chronological in nature.