Candlestick patterns for the S&P 500¶

In [1]:

Copied!





import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from collections import Counter
from pathlib import Path
from collections import namedtuple
from scipy import stats

from BSquant import load_data
from BSquant import process_data
from BSquant import cs_pattern_recognition
from BSquant import cs_performance
from BSquant import plot_cs_performance
from BSquant import compute_trading_strategy_performance

pd.set_option("display.max_columns", None)
%load_ext autoreload
%autoreload 2
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from collections import Counter
from pathlib import Path
from collections import namedtuple
from scipy import stats

from BSquant import load_data
from BSquant import process_data
from BSquant import cs_pattern_recognition
from BSquant import cs_performance
from BSquant import plot_cs_performance
from BSquant import compute_trading_strategy_performance

pd.set_option("display.max_columns", None)
%load_ext autoreload
%autoreload 2

Loading data for the S&P500¶

In [2]:

Copied!





# Define the path to your ticker file
ticker_file = "./../data/SP500_tickers_one_per_line.txt"
notebooks_dir = Path("./../notebooks")
ticker_file_path = notebooks_dir.parent / "data" / ticker_file

tickers = []

# Open the ticker-file with a context manager and read each line adding ot to the list of tickers
with open(ticker_file_path, "r") as file:
    for line in file:
        ticker = line.strip()  # strip newline characters and whitespace
        tickers.append(ticker)  # add the cleaned ticker to the list

print("Number of tickers (may include multiple tickers per stock) is", len(tickers))
print("Number of unique tickers is:", set(tickers).__len__())
# Define the path to your ticker file
ticker_file = "./../data/SP500_tickers_one_per_line.txt"
notebooks_dir = Path("./../notebooks")
ticker_file_path = notebooks_dir.parent / "data" / ticker_file

tickers = []

# Open the ticker-file with a context manager and read each line adding ot to the list of tickers
with open(ticker_file_path, "r") as file:
    for line in file:
        ticker = line.strip()  # strip newline characters and whitespace
        tickers.append(ticker)  # add the cleaned ticker to the list

print("Number of tickers (may include multiple tickers per stock) is", len(tickers))
print("Number of unique tickers is:", set(tickers).__len__())

Number of tickers (may include multiple tickers per stock) is 503
Number of unique tickers is: 503

In [3]:

Copied!

# enumerated tickers in the ticker_file
for i, ticker in enumerate(tickers):
    print(f"{i+1}:{ticker}")
# enumerated tickers in the ticker_file
for i, ticker in enumerate(tickers):
    print(f"{i+1}:{ticker}")

1:MSFT
2:AAPL
3:AMZN
4:NVDA
5:GOOGL
6:META
7:GOOG
8:TSLA
9:BRK.B
10:UNH
11:LLY
12:JPM
13:XOM
14:V
15:AVGO
16:JNJ
17:PG
18:MA
19:HD
20:ADBE
21:COST
22:MRK
23:CVX
24:ABBV
25:CRM
26:PEP
27:KO
28:WMT
29:BAC
30:ACN
31:NFLX
32:MCD
33:LIN
34:CSCO
35:AMD
36:TMO
37:INTC
38:ORCL
39:ABT
40:CMCSA
41:PFE
42:DIS
43:WFC
44:VZ
45:INTU
46:DHR
47:PM
48:IBM
49:AMGN
50:QCOM
51:NOW
52:TXN
53:COP
54:UNP
55:SPGI
56:NKE
57:GE
58:BA
59:HON
60:CAT
61:AMAT
62:RTX
63:T
64:NEE
65:LOW
66:SBUX
67:ELV
68:GS
69:BKNG
70:UPS
71:ISRG
72:PLD
73:MDT
74:BLK
75:BMY
76:TJX
77:MS
78:LMT
79:SYK
80:DE
81:AXP
82:MMC
83:AMT
84:MDLZ
85:PGR
86:GILD
87:LRCX
88:ADP
89:CB
90:ADI
91:VRTX
92:SCHW
93:ETN
94:PANW
95:C
96:REGN
97:CVS
98:MU
99:SNPS
100:BSX
101:ZTS
102:BX
103:FI
104:CME
105:TMUS
106:CI
107:SO
108:EQIX
109:MO
110:KLAC
111:CDNS
112:SLB
113:EOG
114:DUK
115:BDX
116:NOC
117:ITW
118:AON
119:SHW
120:ICE
121:CL
122:CSX
123:MCK
124:PYPL
125:WM
126:TGT
127:CMG
128:APD
129:HUM
130:FDX
131:MPC
132:USB
133:ORLY
134:MCO
135:PSX
136:ROP
137:GD
138:PH
139:ANET
140:MMM
141:APH
142:PXD
143:MSI
144:ABNB
145:AJG
146:FCX
147:PNC
148:TDG
149:NXPI
150:LULU
151:TT
152:CCI
153:EMR
154:MAR
155:HCA
156:NSC
157:WELL
158:ECL
159:PCAR
160:CTAS
161:AZO
162:AIG
163:ADSK
164:NEM
165:SRE
166:MCHP
167:AFL
168:WMB
169:DXCM
170:ROST
171:VLO
172:HLT
173:CPRT
174:CARR
175:GM
176:TFC
177:COF
178:NUE
179:DLR
180:KMB
181:TRV
182:MSCI
183:EW
184:TEL
185:MNST
186:PSA
187:AEP
188:SPG
189:CHTR
190:F
191:MET
192:OKE
193:CNC
194:ADM
195:OXY
196:IQV
197:PAYX
198:CEG
199:DHI
200:HES
201:STZ
202:IDXX
203:EXC
204:O
205:D
206:A
207:BK
208:GIS
209:SYY
210:DOW
211:AMP
212:LHX
213:ALL
214:JCI
215:PCG
216:AME
217:CTSH
218:PRU
219:OTIS
220:KVUE
221:YUM
222:VRSK
223:GWW
224:ODFL
225:FIS
226:IT
227:FAST
228:FTNT
229:KMI
230:EA
231:COR
232:BIIB
233:BKR
234:CSGP
235:XEL
236:PPG
237:HAL
238:RSG
239:DD
240:URI
241:LEN
242:CTVA
243:KDP
244:CMI
245:ROK
246:PEG
247:ED
248:ACGL
249:ON
250:GPN
251:VICI
252:EL
253:KR
254:IR
255:DVN
256:DG
257:MLM
258:VMC
259:CDW
260:HSY
261:KHC
262:FANG
263:EXR
264:PWR
265:CAH
266:FICO
267:GEHC
268:SBAC
269:EFX
270:WEC
271:MPWR
272:WST
273:WTW
274:DLTR
275:MRNA
276:EIX
277:AWK
278:ANSS
279:HPQ
280:RCL
281:XYL
282:TTWO
283:CBRE
284:ZBH
285:FTV
286:KEYS
287:AVB
288:LYB
289:HIG
290:MTD
291:DAL
292:CHD
293:APTV
294:DFS
295:STT
296:WBD
297:RMD
298:WY
299:BR
300:TROW
301:TSCO
302:EBAY
303:HPE
304:GLW
305:DTE
306:ETR
307:MTB
308:ULTA
309:MOH
310:WAB
311:ES
312:HWM
313:TRGP
314:NVR
315:AEE
316:CTRA
317:STE
318:RJF
319:DOV
320:FITB
321:EQR
322:PHM
323:NTAP
324:LH
325:IFF
326:CBOE
327:INVH
328:PPL
329:FE
330:VRSN
331:TDY
332:DRI
333:NDAQ
334:EXPE
335:PTC
336:GRMN
337:IRM
338:GPC
339:STLD
340:VTR
341:BAX
342:CNP
343:EXPD
344:FLT
345:CLX
346:EG
347:BRO
348:AKAM
349:HOLX
350:BALL
351:FDS
352:TYL
353:ARE
354:LVS
355:VLTO
356:ATO
357:FSLR
358:COO
359:WAT
360:CMS
361:BG
362:PFG
363:NTRS
364:MKC
365:AXON
366:CINF
367:HBAN
368:ILMN
369:HUBB
370:J
371:OMC
372:AVY
373:RF
374:SWKS
375:WDC
376:MRO
377:DGX
378:ALGN
379:STX
380:LUV
381:TXT
382:PKG
383:IEX
384:JBHT
385:CCL
386:EPAM
387:WRB
388:LDOS
389:SNA
390:CF
391:EQT
392:MAA
393:LW
394:WBA
395:ALB
396:TER
397:SWK
398:AMCR
399:K
400:DPZ
401:CE
402:BBY
403:ENPH
404:ESS
405:MAS
406:POOL
407:SYF
408:CAG
409:TSN
410:PODD
411:L
412:UAL
413:CFG
414:IP
415:LNT
416:NDSN
417:HST
418:GEN
419:ZBRA
420:MOS
421:LYV
422:LKQ
423:IPG
424:KIM
425:EVRG
426:KEY
427:JKHY
428:TRMB
429:AES
430:TAP
431:ROL
432:SJM
433:MGM
434:APA
435:RVTY
436:VTRS
437:NRG
438:GL
439:TFX
440:BF.B
441:PNR
442:CDAY
443:WRK
444:NI
445:REG
446:FFIV
447:UDR
448:KMX
449:INCY
450:CRL
451:EMN
452:TECH
453:CPT
454:CHRW
455:PEAK
456:CZR
457:QRVO
458:HII
459:AOS
460:ETSY
461:ALLE
462:JNPR
463:MTCH
464:MKTX
465:AIZ
466:PAYC
467:HRL
468:RHI
469:HSIC
470:UHS
471:PNW
472:NWSA
473:WYNN
474:AAL
475:BXP
476:BWA
477:CPB
478:FOXA
479:BBWI
480:TPR
481:GNRC
482:BEN
483:CTLT
484:FRT
485:PARA
486:FMC
487:XRAY
488:IVZ
489:BIO
490:NCLH
491:CMA
492:WHR
493:HAS
494:VFC
495:DVA
496:ZION
497:RL
498:SEE
499:ALK
500:MHK
501:SEDG
502:FOX
503:NWS

Remove rows with missing data¶

In [4]:

Copied!





data_filename = "SP500_daily_data_1980_to_2023.csv.gz"
notebooks_dir = Path("./../notebooks")
data_file_path = notebooks_dir.parent / "data" / data_filename
print(data_file_path)
data_filename = "SP500_daily_data_1980_to_2023.csv.gz"
notebooks_dir = Path("./../notebooks")
data_file_path = notebooks_dir.parent / "data" / data_filename
print(data_file_path)

../data/SP500_daily_data_1980_to_2023.csv.gz

In [5]:

Copied!

%time
# passing a function as argument to another function; load all data we got
df = process_data(load_data(file_path=data_file_path, compression="gzip"))
%time
# passing a function as argument to another function; load all data we got
df = process_data(load_data(file_path=data_file_path, compression="gzip"))

CPU times: user 1 µs, sys: 1 µs, total: 2 µs
Wall time: 4.53 µs

In [6]:

Copied!

df.shape
df.shape

Out[6]:

(3169549, 13)

In [7]:

Copied!

df.info()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3169549 entries, 0 to 3169548
Data columns (total 13 columns):
 #   Column                Dtype         
---  ------                -----         
 0   ticker                object        
 1   date                  datetime64[ns]
 2   prc                   float64       
 3   vol                   int64         
 4   close                 float64       
 5   low                   float64       
 6   high                  float64       
 7   open                  float64       
 8   price_vol             float64       
 9   intraday_return       float64       
 10  sign_intraday_return  int64         
 11  next_intraday_return  float64       
 12  sign_next_day_return  Int64         
dtypes: Int64(1), datetime64[ns](1), float64(8), int64(2), object(1)
memory usage: 317.4+ MB

In [8]:

Copied!

df.head()
df.head()

Out[8]:

	ticker	date	prc	vol	close	low	high	open	price_vol	intraday_return	sign_intraday_return	next_intraday_return	sign_next_day_return
0	AMT	1992-06-15	7.375	26700	7.375	7.375	7.500	7.375	196912.5	0.000000	0	0.000000	0
1	AMT	1992-06-16	7.375	6300	7.375	7.375	7.500	7.375	46462.5	0.000000	0	-0.016949	-1
2	AMT	1992-06-17	7.250	16400	7.250	7.250	7.375	7.375	118900.0	-0.016949	-1	0.000000	0
3	AMT	1992-06-18	7.250	6900	7.250	7.250	7.375	7.250	50025.0	0.000000	0	0.016949	1
4	AMT	1992-06-19	7.500	13700	7.500	7.250	7.500	7.375	102750.0	0.016949	1	0.000000	0

In [9]:

Copied!

df["ticker"].unique().__len__()
df["ticker"].unique().__len__()

Out[9]:

We have data fo 501 stocks though in our ticker file we got 503 tickers. We obtained the tickers from a public source and their ticker formatting might be different from CRSP. This is not significant for our constex, however, it is worthwhile making sure the metadata of tradable securities is updated regularly.

Let us now compute for how many days per stock we got data for.¶

Companies and be included and taken off the S&P 500 index. Some startups that were not previously listed might prosper and develop into companies large enough to be included in the index, while other may be outcompeted by others, cease to exist, acquired, split up, or taken private and hence either be excluded and/or delisted. Let us investigate for how many days we got data for each stock. While doing so, we find an interesting detour related to Python performance:

Python is primarily considered an interpreted language. Python code is executed by an interpreter, which reads the code at runtime and executes it line by line. This process is different from compiled languages, where the source code is transformed into machine code or bytecode before execution, typically resulting in an executable file. However, at a more detailed level, Python code is indeed compiled under the hood. More precisely, when Python code is executed, it is compiled into bytecode, which is a lower-level, platform-independent representation of the source code. This bytecode is then interpreted by the Python Virtual Machine (PVM), however, compared to a purely compiled language such as C or C++, not turned into a standalone executable file. This process is automatic and transparent to the user, making Python feel like a purely interpreted language. Tools and third-party packages do exist that can package Python programs along with an interpreter into a standalone executable, but this is an additional step beyond Python's standard behavior.

The important point is that parsing byte-code through the PVM imposes an overhead which costs time. Hence, Python is considered "slow". However, you can use C and C++ code within Python to leverage performance benefits. This is a common practice for computational heavy tasks where the execution speed of Python is a bottleneck. Integrating C or C++ code into Python can significantly improve the performance of certain operations, especially those that are CPU-bound, such as numerical computations, data processing, and more. This, however, required more detailed knowlege of the Python compiler, is not straight forward, and a topic for another repository.

However, that does not mean we cannot speed up our code. Paricularly, we can make use of libraries that are written, at least partially, in C and available in Python, such as numpy. As pandas makes use of numpy, it is often possible to enjoy better performance, especially when we compute in-momory like we using pandas. Thus, it is generally good advise for the sake of performance, to "write highl-level code thinking low-level", and the following is meant to demonstrate this.

To compute the number of days we got for each of the S&P500 members, a streight-forward (but slow) method is to loop trough each ticker, filter the data frame according to the ticker, and to compute the number of rows. This will be executed below.

In [10]:

Copied!





%%time 

days_per_ticker = {}

for ticker in tickers:
    days_per_ticker[ticker] = df.query("ticker == @ticker").shape[
        0
    ]  # takes about 16.8 seconds
#     days_per_ticker[ticker] = df[df['ticker'] == ticker].shape[0]  # takes about 59.6 seconds and is three times slower, still.
%%time 

days_per_ticker = {}

for ticker in tickers:
    days_per_ticker[ticker] = df.query("ticker == @ticker").shape[
        0
    ]  # takes about 16.8 seconds
#     days_per_ticker[ticker] = df[df['ticker'] == ticker].shape[0]  # takes about 59.6 seconds and is three times slower, still.

CPU times: user 23 s, sys: 0 ns, total: 23 s
Wall time: 23 s

As you see this step took approximately 16.8 seconds on the machine this code was executed on. Making use of the pandas native .goupby() method, which is written in C, and storing the results in a dictionary, achieves the same task in just about 112 ms, i.e, the computation is 150 times, i.e. an order of magnitude faster.

In [11]:

Copied!

%%time 
days_per_ticker = df.groupby("ticker").size().to_dict();
%%time 
days_per_ticker = df.groupby("ticker").size().to_dict();

CPU times: user 115 ms, sys: 0 ns, total: 115 ms
Wall time: 115 ms

We now investigate how the length of the history of each stock [in days] is distributed¶

In [12]:

Copied!

plt.figure(figsize=(7, 7))
plt.hist(list(days_per_ticker.values()), bins=30)
plt.show()
plt.figure(figsize=(7, 7))
plt.hist(list(days_per_ticker.values()), bins=30)
plt.show()

No description has been provided for this image

In [13]:

Copied!





# Counter objects are a part of the collections module in Python's standard library.
# They are specialized dictionary subclasses designed to count hashable objects.
# A Counter is a collection where elements are stored as dictionary keys and their counts are stored
# as dictionary values.

Counter(list(days_per_ticker.values())).most_common(3)[
    0
]  # Counter(list(days_per_ticker.values())).most_common(3)[0][0] then extracts the number that occurs most often.

# 81 stocks contain 7881 days of data
# Counter objects are a part of the collections module in Python's standard library.
# They are specialized dictionary subclasses designed to count hashable objects.
# A Counter is a collection where elements are stored as dictionary keys and their counts are stored
# as dictionary values.

Counter(list(days_per_ticker.values())).most_common(3)[
    0
]  # Counter(list(days_per_ticker.values())).most_common(3)[0][0] then extracts the number that occurs most often.

# 81 stocks contain 7881 days of data

Out[13]:

(7944, 80)

How are the stocks weighted with repect to the one with the longst history in the portfolio?¶

In [14]:

Copied!





max_days_ticker = max(
    days_per_ticker, key=days_per_ticker.get
)  # find the ticker with the maximum number of days
max_days = days_per_ticker[
    max_days_ticker
]  # retrieve the value (number of days) for this ticker
print(
    f"The ticker with the maximum number of days is: {max_days_ticker}, with {max_days} days."
)

max_days = max(days_per_ticker.values())  # find the maximum number of days
weights_per_ticker = {
    ticker: days / max_days for ticker, days in days_per_ticker.items()
}  # Calculate the weight for each ticker, z-transform the weights should you wish to use them in ML applications
weights_per_ticker
max_days_ticker = max(
    days_per_ticker, key=days_per_ticker.get
)  # find the ticker with the maximum number of days
max_days = days_per_ticker[
    max_days_ticker
]  # retrieve the value (number of days) for this ticker
print(
    f"The ticker with the maximum number of days is: {max_days_ticker}, with {max_days} days."
)

max_days = max(days_per_ticker.values())  # find the maximum number of days
weights_per_ticker = {
    ticker: days / max_days for ticker, days in days_per_ticker.items()
}  # Calculate the weight for each ticker, z-transform the weights should you wish to use them in ML applications
weights_per_ticker

The ticker with the maximum number of days is: LEN, with 13157 days.

Out[14]:

{'A': 0.5693547161206962,
 'AAL': 0.2824352055939804,
 'AAPL': 0.603253021205442,
 'ABBV': 0.2103823059968078,
 'ABNB': 0.05837196929391199,
 'ABT': 0.6037850573839021,
 'ACGL': 0.4512426845025462,
 'ACN': 0.5907121684274531,
 'ADBE': 0.6034810367104964,
 'ADI': 0.6037850573839021,
 'ADM': 0.6037850573839021,
 'ADP': 0.5999087937979782,
 'ADSK': 0.5292999923994831,
 'AEE': 0.555749790985787,
 'AEP': 0.6037850573839021,
 'AES': 0.5204073877023637,
 'AFL': 0.6037850573839021,
 'AIG': 0.6037850573839021,
 'AIZ': 0.5872159306832865,
 'AJG': 0.6027969901953333,
 'AKAM': 0.46218742874515467,
 'ALB': 0.5702667781409135,
 'ALGN': 0.4382458007144486,
 'ALK': 0.6037850573839021,
 'ALL': 0.5988447214410579,
 'ALLE': 0.34057915938283806,
 'AMAT': 0.6037090522155507,
 'AMCR': 0.2427605077145246,
 'AMD': 0.6036330470471992,
 'AME': 0.6033290263737934,
 'AMGN': 0.6036330470471992,
 'AMP': 0.4795166071292848,
 'AMT': 0.5665425248916927,
 'AMZN': 0.5093106331230524,
 'ANET': 0.29429201185680626,
 'ANSS': 0.5261837804970738,
 'AON': 0.26936231663753135,
 'AOS': 0.5493653568442655,
 'APA': 0.6037090522155507,
 'APD': 0.6037850573839021,
 'APH': 0.6034810367104964,
 'APTV': 0.1912290035722429,
 'ARE': 0.5086265866078893,
 'ATO': 0.6036330470471992,
 'AVB': 0.4891692635099187,
 'AVGO': 0.3213498517899217,
 'AVY': 0.6037090522155507,
 'AWK': 0.5027741886448278,
 'AXON': 0.1264726001368093,
 'AXP': 0.6037090522155507,
 'AZO': 0.6037850573839021,
 'BA': 0.6037850573839021,
 'BAC': 0.6037850573839021,
 'BALL': 0.03139013452914798,
 'BAX': 0.6037850573839021,
 'BBWI': 0.04613513718932887,
 'BBY': 0.6038610625522536,
 'BDX': 0.6037850573839021,
 'BEN': 0.6040890780573079,
 'BG': 0.5619822147906057,
 'BIIB': 0.38458615185832634,
 'BIO': 0.7769248308885004,
 'BK': 0.6038610625522536,
 'BKNG': 0.30584479744622634,
 'BKR': 0.487193129132781,
 'BLK': 0.5967165767272175,
 'BMY': 0.6037850573839021,
 'BR': 0.5843277342859314,
 'BRO': 0.4841529223987231,
 'BSX': 0.6037850573839021,
 'BWA': 0.5812875275518735,
 'BX': 0.31678954168883483,
 'BXP': 0.5078665349243748,
 'C': 0.6027209850269818,
 'CAG': 0.6037850573839021,
 'CAH': 0.560842137265334,
 'CARR': 0.07159686858706392,
 'CAT': 0.6037850573839021,
 'CB': 0.6036330470471992,
 'CBOE': 0.2591776240784373,
 'CBRE': 0.11066352511970814,
 'CCI': 0.5576499201945733,
 'CCL': 0.6037090522155507,
 'CDAY': 0.1086873907425705,
 'CDNS': 0.3474956297028198,
 'CDW': 0.20110967545793115,
 'CE': 0.5166071292847914,
 'CEG': 0.2825112107623318,
 'CF': 0.5511894808847002,
 'CFG': 0.17732005776392795,
 'CHD': 0.6037090522155507,
 'CHRW': 0.501102074941096,
 'CHTR': 0.4561830204453903,
 'CI': 0.6037090522155507,
 'CINF': 0.6035570418788477,
 'CL': 0.6037850573839021,
 'CLX': 0.6037850573839021,
 'CMA': 0.6037090522155507,
 'CMCSA': 0.6016569126700616,
 'CME': 0.5359124420460591,
 'CMG': 0.40404347495629706,
 'CMI': 0.5437409743862582,
 'CMS': 0.6037850573839021,
 'CNC': 0.5809835068784678,
 'CNP': 0.7176407995743711,
 'COF': 0.5570418788477617,
 'COO': 0.6033290263737934,
 'COP': 0.5590940183932508,
 'COR': 0.4849129740822376,
 'COST': 0.5403967469787946,
 'CPB': 0.6038610625522536,
 'CPRT': 0.568974690278939,
 'CPT': 0.5825036102454967,
 'CRL': 0.5372805350763852,
 'CRM': 0.5099186744698639,
 'CSCO': 0.6034810367104964,
 'CSGP': 0.466823744014593,
 'CSX': 0.6037850573839021,
 'CTAS': 0.6037090522155507,
 'CTLT': 0.18020825416128297,
 'CTRA': 0.27399863190696966,
 'CTSH': 0.4882572014897013,
 'CTVA': 0.08770996427757087,
 'CVS': 0.5210154290491753,
 'CVX': 0.42517291175799954,
 'CZR': 0.2549973398191077,
 'D': 0.6037850573839021,
 'DAL': 0.5741430417268374,
 'DD': 0.5704187884776165,
 'DE': 0.6037850573839021,
 'DFS': 0.5542296876187581,
 'DG': 0.5080185452610777,
 'DGX': 0.515695067264574,
 'DHI': 0.5365204833928707,
 'DHR': 0.6037850573839021,
 'DIS': 0.6037850573839021,
 'DLR': 0.36672493729573613,
 'DLTR': 0.5512654860530516,
 'DOV': 0.6037850573839021,
 'DOW': 0.5736870107167288,
 'DPZ': 0.3725773352587976,
 'DRI': 0.547085201793722,
 'DTE': 0.6037850573839021,
 'DUK': 0.6037090522155507,
 'DVA': 0.44409819867751005,
 'DVN': 0.6037850573839021,
 'DXCM': 0.3579843429353196,
 'EA': 0.3443794178004104,
 'EBAY': 0.5080945504294292,
 'ECL': 0.6037090522155507,
 'ED': 0.6037850573839021,
 'EFX': 0.6037850573839021,
 'EG': 0.009272630538876643,
 'EIX': 0.53386030250057,
 'EL': 0.5378125712548454,
 'ELV': 0.21631070912822073,
 'EMN': 0.573915026221783,
 'EMR': 0.6037850573839021,
 'ENPH': 0.22474728281523143,
 'EOG': 0.6037090522155507,
 'EPAM': 0.22748346887588355,
 'EQIX': 0.44706240024321653,
 'EQR': 0.5814395378885764,
 'EQT': 0.6037850573839021,
 'ES': 0.2752907197689443,
 'ESS': 0.5657064680398267,
 'ETN': 0.6037850573839021,
 'ETR': 0.6037090522155507,
 'ETSY': 0.16667933419472525,
 'EVRG': 0.13544121000228015,
 'EW': 0.45405487573154973,
 'EXC': 0.5735350003800258,
 'EXPD': 0.6027969901953333,
 'EXPE': 0.4234247928859162,
 'EXR': 0.3709052215550657,
 'F': 0.6041650832256593,
 'FANG': 0.2148666109295432,
 'FAST': 0.6036330470471992,
 'FCX': 0.733753895264878,
 'FDS': 0.5253477236452079,
 'FDX': 0.6037850573839021,
 'FE': 0.4998859922474728,
 'FFIV': 0.4698639507486509,
 'FI': 0.27460667325378124,
 'FICO': 0.27491069392718703,
 'FIS': 0.3427073040966786,
 'FITB': 0.6034810367104964,
 'FLT': 0.48042866914950216,
 'FMC': 0.6037090522155507,
 'FOX': 0.4114159762863875,
 'FOXA': 0.20095766512122826,
 'FRT': 0.6038610625522536,
 'FSLR': 0.3273542600896861,
 'FTNT': 0.26997035798434293,
 'FTV': 0.14334574751083073,
 'GD': 0.6038610625522536,
 'GE': 0.6037090522155507,
 'GEHC': 0.018925286919510526,
 'GEN': 0.35775632743026525,
 'GILD': 0.603253021205442,
 'GIS': 0.6037850573839021,
 'GL': 0.217374781485141,
 'GLW': 0.6038610625522536,
 'GM': 0.5752831192521092,
 'GNRC': 0.2656380633883104,
 'GOOG': 0.37037318537660563,
 'GOOGL': 0.1864406779661017,
 'GPC': 0.6037850573839021,
 'GPN': 0.4380937903777457,
 'GRMN': 0.48962529452002734,
 'GS': 0.47168807478908564,
 'GWW': 0.6037850573839021,
 'HAL': 0.6037090522155507,
 'HAS': 0.6037850573839021,
 'HBAN': 0.6035570418788477,
 'HCA': 0.4008512578855362,
 'HD': 0.603253021205442,
 'HES': 0.3376149578171316,
 'HIG': 0.5362164627194649,
 'HII': 0.3636087253933267,
 'HLT': 0.48658508778596943,
 'HOLX': 0.6035570418788477,
 'HON': 0.6036330470471992,
 'HPE': 0.15611461579387398,
 'HPQ': 0.41438017785209397,
 'HRL': 0.6037090522155507,
 'HSIC': 0.5384966177700083,
 'HST': 0.4100478832560614,
 'HSY': 0.6037850573839021,
 'HUBB': 0.15330242456487042,
 'HUM': 0.6037850573839021,
 'HWM': 0.12077221251045071,
 'IBM': 0.6034050315421449,
 'ICE': 0.35243596564566393,
 'IDXX': 0.6037090522155507,
 'IEX': 0.603253021205442,
 'IFF': 0.6037090522155507,
 'ILMN': 0.44782245192673104,
 'INCY': 0.5349243748574903,
 'INTC': 0.6034810367104964,
 'INTU': 0.589344075397127,
 'INVH': 0.13224899293151934,
 'IP': 0.6036330470471992,
 'IPG': 0.6037850573839021,
 'IQV': 0.11704795926122977,
 'IR': 0.6035570418788477,
 'IRM': 0.47214410579919436,
 'ISRG': 0.4503306224823288,
 'IT': 0.6427757087481949,
 'ITW': 0.6037850573839021,
 'IVZ': 0.3177016037090522,
 'J': 0.26700615641863645,
 'JBHT': 0.6037090522155507,
 'JCI': 0.6037850573839021,
 'JKHY': 0.6037090522155507,
 'JNJ': 0.6037850573839021,
 'JNPR': 0.4687238732233792,
 'JPM': 0.6037850573839021,
 'K': 0.6037850573839021,
 'KDP': 0.10481112715664666,
 'KEY': 0.41460819335714827,
 'KEYS': 0.39097058599984796,
 'KHC': 0.1624990499353956,
 'KIM': 0.6038610625522536,
 'KLAC': 0.6034810367104964,
 'KMB': 0.6037090522155507,
 'KMI': 0.39233867903017405,
 'KMX': 0.5146309949076537,
 'KO': 0.6037850573839021,
 'KR': 0.6037850573839021,
 'KVUE': 0.012616857946340352,
 'L': 0.5079425400927263,
 'LDOS': 0.19624534468343846,
 'LEN': 1.0,
 'LH': 0.5485292999923995,
 'LHX': 0.08618986091054191,
 'LIN': 0.3017405183552482,
 'LKQ': 0.2220110967545793,
 'LLY': 0.6037090522155507,
 'LMT': 0.5511134757163487,
 'LNT': 0.491601428897165,
 'LOW': 0.6035570418788477,
 'LRCX': 0.6036330470471992,
 'LULU': 0.31428137113323706,
 'LUV': 0.6033290263737934,
 'LVS': 0.3642927719084898,
 'LW': 0.136429277190849,
 'LYB': 0.2556053811659193,
 'LYV': 0.3446834384738162,
 'MA': 0.41726837424944896,
 'MAA': 0.5723949228547541,
 'MAR': 0.5887360340503154,
 'MAS': 0.6037850573839021,
 'MCD': 0.6037850573839021,
 'MCHP': 0.5887360340503154,
 'MCK': 0.6021889488485217,
 'MCO': 0.4636315269438322,
 'MDLZ': 0.2150186212662461,
 'MDT': 0.6037090522155507,
 'MET': 0.4840009120620202,
 'META': 0.2255073344987459,
 'MGM': 0.4985178992171468,
 'MHK': 0.49798586303868664,
 'MKC': 0.8810519115299841,
 'MKTX': 0.36634491145397885,
 'MLM': 0.5704947936459679,
 'MMC': 0.6037850573839021,
 'MMM': 0.6037090522155507,
 'MNST': 0.334802766588128,
 'MO': 0.6035570418788477,
 'MOH': 0.39218666869347113,
 'MOS': 0.37174127840693166,
 'MPC': 0.23903625446530363,
 'MPWR': 0.40753971270046363,
 'MRK': 0.6035570418788477,
 'MRNA': 0.16667933419472525,
 'MRO': 0.6036330470471992,
 'MS': 0.583567682602417,
 'MSCI': 0.25811355172151706,
 'MSFT': 0.6034810367104964,
 'MSI': 0.5175951964733602,
 'MTB': 0.5007220490993387,
 'MTCH': 0.2575055103747055,
 'MTD': 0.49958197157406703,
 'MU': 0.6037090522155507,
 'NCLH': 0.20947024397659042,
 'NDAQ': 0.3613285703427833,
 'NDSN': 0.6037090522155507,
 'NEE': 0.25872159306832865,
 'NEM': 0.6037850573839021,
 'NFLX': 0.41324010032682224,
 'NI': 0.6037850573839021,
 'NKE': 0.6037850573839021,
 'NOC': 0.6037850573839021,
 'NOW': 0.36345671505662386,
 'NRG': 0.4765524055635783,
 'NSC': 0.6037850573839021,
 'NTAP': 0.5372805350763852,
 'NTRS': 0.6036330470471992,
 'NUE': 0.6049251349091739,
 'NVDA': 0.4768564262369841,
 'NVR': 0.6002128144713841,
 'NWS': 0.681994375617542,
 'NWSA': 0.2870715208634187,
 'NXPI': 0.25636543284943375,
 'O': 0.5998327886296269,
 'ODFL': 0.5733829900433229,
 'OKE': 0.6037850573839021,
 'OMC': 0.6037850573839021,
 'ON': 0.16728737554153683,
 'ORCL': 0.6036330470471992,
 'ORLY': 0.5869119100098806,
 'OTIS': 0.07159686858706392,
 'OXY': 0.6037850573839021,
 'PANW': 0.21889488485216996,
 'PARA': 0.10899141141597629,
 'PAYC': 0.26746218742874517,
 'PAYX': 0.6035570418788477,
 'PCAR': 0.6038610625522536,
 'PCG': 0.6037850573839021,
 'PEAK': 0.3123052367560994,
 'PEG': 0.6037850573839021,
 'PEP': 0.6037850573839021,
 'PFE': 0.6034050315421449,
 'PFG': 0.5600820855818196,
 'PG': 0.6037850573839021,
 'PGR': 0.603937067720605,
 'PH': 0.6037090522155507,
 'PHM': 0.6037850573839021,
 'PKG': 0.4574751083073649,
 'PLD': 0.48772516531124116,
 'PM': 0.4174963897545033,
 'PNC': 0.603253021205442,
 'PNR': 0.532492209470244,
 'PNW': 0.6037850573839021,
 'PODD': 0.31823363988751235,
 'POOL': 0.5369765144029794,
 'PPG': 0.6037850573839021,
 'PPL': 0.6037850573839021,
 'PRU': 0.44554229687618757,
 'PSA': 0.599528767956221,
 'PSX': 0.3340427149046135,
 'PTC': 0.5418408451774721,
 'PWR': 0.5365204833928707,
 'PXD': 0.5050543436953713,
 'PYPL': 0.17389982518811278,
 'QCOM': 0.6035570418788477,
 'QRVO': 0.17207570114767803,
 'RCL': 0.5869879151782321,
 'REG': 0.5765752071140837,
 'REGN': 0.6037090522155507,
 'RF': 0.47450026601808926,
 'RHI': 0.6021129436801702,
 'RJF': 0.6037850573839021,
 'RL': 0.5327202249752984,
 'RMD': 0.46378353728053506,
 'ROK': 0.6034050315421449,
 'ROL': 0.6044691038990652,
 'ROP': 0.5194953256821464,
 'ROST': 0.6037850573839021,
 'RSG': 0.48772516531124116,
 'RTX': 0.12898077069240707,
 'RVTY': 0.012008816599528767,
 'SBAC': 0.4693319145701908,
 'SBUX': 0.6030250057003876,
 'SCHW': 0.3449114539788706,
 'SEDG': 0.16774340655164552,
 'SEE': 0.6034050315421449,
 'SHW': 0.6036330470471992,
 'SJM': 0.7606597248612905,
 'SLB': 0.6037850573839021,
 'SNA': 0.6037850573839021,
 'SNPS': 0.6036330470471992,
 'SO': 0.6037850573839021,
 'SPG': 0.5836436877707684,
 'SPGI': 0.14684198525499734,
 'SRE': 0.49828988371209243,
 'STE': 0.5289959717260774,
 'STLD': 0.5189632895036862,
 'STT': 0.5790833776696815,
 'STX': 0.4832408603785057,
 'STZ': 0.8055027741886448,
 'SWK': 0.6037850573839021,
 'SWKS': 0.4114919814547389,
 'SYF': 0.18020825416128297,
 'SYK': 0.5056623850421829,
 'SYY': 0.6034810367104964,
 'T': 0.6031770160370905,
 'TAP': 0.6661092954320894,
 'TDG': 0.34050315421448657,
 'TDY': 0.5410047883256062,
 'TECH': 0.6037090522155507,
 'TEL': 0.43376149578171314,
 'TER': 0.6035570418788477,
 'TFC': 0.3738694231207722,
 'TFX': 0.6037090522155507,
 'TGT': 0.5111347571634871,
 'TJX': 0.6037090522155507,
 'TMO': 0.6037090522155507,
 'TMUS': 0.2042258873603405,
 'TPR': 0.2955080945504294,
 'TRGP': 0.3090370145169872,
 'TRMB': 0.6036330470471992,
 'TROW': 0.6034810367104964,
 'TRV': 0.4701679714220567,
 'TSCO': 0.5679106179220187,
 'TSLA': 0.2583415672265714,
 'TSN': 0.5011780801094474,
 'TT': 0.3194497225811355,
 'TTWO': 0.5093106331230524,
 'TXN': 0.6036330470471992,
 'TXT': 0.6037850573839021,
 'TYL': 0.6024929695219275,
 'UAL': 0.460211294368017,
 'UDR': 0.6038610625522536,
 'UHS': 0.6037090522155507,
 'ULTA': 0.30956905069544727,
 'UNH': 0.603253021205442,
 'UNP': 0.6037850573839021,
 'UPS': 0.4615793873983431,
 'URI': 0.4978338527019837,
 'USB': 0.4499505966405716,
 'V': 0.510602720985027,
 'VFC': 0.6037850573839021,
 'VICI': 0.11309569050695448,
 'VLO': 0.603253021205442,
 'VLTO': 0.0047883256061412175,
 'VMC': 0.6037850573839021,
 'VRSK': 0.2722505130348864,
 'VRSN': 0.4954016873147374,
 'VRTX': 0.6034050315421449,
 'VTR': 0.49106939271870487,
 'VTRS': 0.059588051987535154,
 'VZ': 0.4492665501254085,
 'WAB': 0.5809075017101163,
 'WAT': 0.5378125712548454,
 'WBA': 0.1721517063160295,
 'WBD': 0.21099034734361938,
 'WDC': 0.6035570418788477,
 'WEC': 0.6031010108687391,
 'WELL': 0.09721061032150186,
 'WFC': 0.6037850573839021,
 'WHR': 0.6037850573839021,
 'WM': 0.4629474804286691,
 'WMB': 0.6037850573839021,
 'WMT': 0.6036330470471992,
 'WRB': 0.30158850801854525,
 'WRK': 0.16257505510374706,
 'WST': 0.6022649540168732,
 'WTW': 0.4052595576499202,
 'WY': 0.6038610625522536,
 'WYNN': 0.40518355248156873,
 'XEL': 0.5454130880899901,
 'XOM': 0.4605153150414228,
 'XRAY': 0.6036330470471992,
 'XYL': 0.23257581515543058,
 'YUM': 0.5017101162879076,
 'ZBH': 0.1628030706088014,
 'ZBRA': 0.6037090522155507,
 'ZION': 0.603253021205442,
 'ZTS': 0.20878619746142738}

Alternatively, we could define a start and end date ourselves and make sure to select only those stocks with a densely populated history.¶

Densely here means that the stocks should have the same number of data. This ensures stocks thate were recently taken in are not selected as they do not contain enough data

In [15]:

Copied!





selected_start_date = pd.Timestamp(2012, 1, 1)
selected_end_date = pd.Timestamp(2022, 12, 31)
df_filtered = df[
    (df["date"] >= selected_start_date) & (df["date"] <= selected_end_date)
]
df_filtered.groupby(
    "ticker"
).size().value_counts()  # 374 out of the 500 stocks are of the desired duration
selected_start_date = pd.Timestamp(2012, 1, 1)
selected_end_date = pd.Timestamp(2022, 12, 31)
df_filtered = df[
    (df["date"] >= selected_start_date) & (df["date"] <= selected_end_date)
]
df_filtered.groupby(
    "ticker"
).size().value_counts()  # 374 out of the 500 stocks are of the desired duration

Out[15]:

2768    374
2769     14
2767      4
2770      3
2771      3
       ... 
2707      1
130       1
2772      1
2372      1
2497      1
Name: count, Length: 96, dtype: int64

If we wanted to just go for the mode directly, we could have achieved this by

In [16]:

Copied!

mode_size = df_filtered.groupby("ticker").size().mode()[0]
mode_size
mode_size = df_filtered.groupby("ticker").size().mode()[0]
mode_size

Out[16]:

And counted the number of stocks of that length using

In [17]:

Copied!





df_common_size = (
    df_filtered.groupby("ticker")
    .filter(lambda x: len(x) == mode_size)
    .reset_index(drop=True)
)

df_common_size["ticker"].unique().__len__()
df_common_size = (
    df_filtered.groupby("ticker")
    .filter(lambda x: len(x) == mode_size)
    .reset_index(drop=True)
)

df_common_size["ticker"].unique().__len__()

Out[17]:

If we selected only those stocks that have an equal amount of days between our start and end day, we have to reduce our universe from 500 stocks to 374. This is a significant reduction that one should be sure to afford.

As an alternative way, we appreaciate the different length of the data and conduct the pattern analysis for each of them separately.

How does each stock evolve in time?¶

We limit ourselves to onem year of data to see how each of the stocks in the portfolio performed relative to their starting price

In [18]:

Copied!





def add_normalized_price(df: pd.DataFrame) -> pd.DataFrame:
    df["first_price_indicator"] = np.where(df.index == 0, 1, 0)
    df["first_price_value"] = df["first_price_indicator"] * df["close"]
    df["first_price_value"].replace(to_replace=0, method="ffill", inplace=True)
    df["normalized_price"] = df["close"] / df["first_price_value"]
    df.drop(columns=["first_price_indicator", "first_price_value"], inplace=True)
    return df
def add_normalized_price(df: pd.DataFrame) -> pd.DataFrame:
    df["first_price_indicator"] = np.where(df.index == 0, 1, 0)
    df["first_price_value"] = df["first_price_indicator"] * df["close"]
    df["first_price_value"].replace(to_replace=0, method="ffill", inplace=True)
    df["normalized_price"] = df["close"] / df["first_price_value"]
    df.drop(columns=["first_price_indicator", "first_price_value"], inplace=True)
    return df

In [19]:

Copied!





# implementation using a multi-indexed data frame
result = df_common_size.set_index(["ticker", "date"]).join(
    df_common_size.groupby("ticker").first().add_prefix("first_")
)  # dg.set_index(['Date','ListingId']) will be equivalent to the vectorized version
result["normalized_price"] = result["close"] / result["first_close"]
# implementation using a multi-indexed data frame
result = df_common_size.set_index(["ticker", "date"]).join(
    df_common_size.groupby("ticker").first().add_prefix("first_")
)  # dg.set_index(['Date','ListingId']) will be equivalent to the vectorized version
result["normalized_price"] = result["close"] / result["first_close"]

In [20]:

Copied!





# plotting the data. Note you can select a nunber of stocks via the variable "counter" as well.

selected_start_date = pd.Timestamp(2022, 1, 1)
selected_end_date = pd.Timestamp(2022, 12, 31)

df_filtered = df[
    (df["date"] >= selected_start_date) & (df["date"] <= selected_end_date)
]
df_filtered

mode_size = df_filtered.groupby("ticker").size().mode()[0]

df_common_size = (
    df_filtered.groupby("ticker")
    .filter(lambda x: len(x) == mode_size)
    .reset_index(drop=True)
)

result = df_common_size.set_index(["ticker", "date"]).join(
    df_common_size.groupby("ticker").first().add_prefix("first_")
)  # dg.set_index(['Date','ListingId']) will be equivalent to the vectorized version
result["normalized_price"] = result["close"] / result["first_close"]

plt.figure(figsize=(10, 6))

counter = 0

for ticker, data in result.groupby(level="ticker"):
    plt.plot(
        data.index.get_level_values("date"), data["normalized_price"], label=ticker
    )
#     counter += 1
#     if counter == 20:
#         break

# plt.legend(title='Ticker', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title("Normalised price by date for the selected stocks")
plt.xlabel("date")
plt.ylabel("normalized price")
plt.tight_layout()
plt.show()
# plotting the data. Note you can select a nunber of stocks via the variable "counter" as well.

selected_start_date = pd.Timestamp(2022, 1, 1)
selected_end_date = pd.Timestamp(2022, 12, 31)

df_filtered = df[
    (df["date"] >= selected_start_date) & (df["date"] <= selected_end_date)
]
df_filtered

mode_size = df_filtered.groupby("ticker").size().mode()[0]

df_common_size = (
    df_filtered.groupby("ticker")
    .filter(lambda x: len(x) == mode_size)
    .reset_index(drop=True)
)

result = df_common_size.set_index(["ticker", "date"]).join(
    df_common_size.groupby("ticker").first().add_prefix("first_")
)  # dg.set_index(['Date','ListingId']) will be equivalent to the vectorized version
result["normalized_price"] = result["close"] / result["first_close"]

plt.figure(figsize=(10, 6))

counter = 0

for ticker, data in result.groupby(level="ticker"):
    plt.plot(
        data.index.get_level_values("date"), data["normalized_price"], label=ticker
    )
#     counter += 1
#     if counter == 20:
#         break

# plt.legend(title='Ticker', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title("Normalised price by date for the selected stocks")
plt.xlabel("date")
plt.ylabel("normalized price")
plt.tight_layout()
plt.show()

We see that the stock prices are not adapted to stock splits. As we work with intraday returns, we get away dealing with price-adjustments which, strictly speaking, is a topic in its own right. However, we do see the investment universe evolved, seemingly randomly, and there were constituents that performed positively, neutral, and negatively.

Hence, selecting one stock in hindsight, and evaluating its buy-and-hold performance, is subject to bias. We will test in the following, how the candlestick patterns perform for the investment universe, to see whether they allow a a portfolion to be actively managed in a long-short fashion.

Candlestick analysis¶

We are now in a position to analyse the whole investment universe. Unfortunately, we now need to deal with a problem we so far go around with: "Big data analysis" comes with "big computational resources". Recall, we need to:

Loading data to memory coming from 500 stocks, with a history of up to 40 years,
Making the pattern recogniton logic act on them, where:
- for every date, we create up to 61 new rows, where 61 is the number of candlestick patterns we are able to identify using talib.

Unfortunately, this exceeds the memory resources of a standard workstation or laptop.

In the following we hence limit ourselves to two years of data and outline how to proceed the analysis, but need to leave considering a wider time-interval to the interested reader whos has a more powerful machine available.

In [21]:

Copied!





# the path to the data file is the same as for the notbook discussing the single-stock case.
data_filename = "SP500_daily_data_1980_to_2023.csv.gz"
notebooks_dir = Path("./../notebooks")
data_file_path = notebooks_dir.parent / "data" / data_filename

df = process_data(
    load_data(
        data_file_path,
        selected_start_date=pd.Timestamp(2019, 1, 1),
        selected_end_date=pd.Timestamp(2022, 12, 31),
    )
)
# the path to the data file is the same as for the notbook discussing the single-stock case.
data_filename = "SP500_daily_data_1980_to_2023.csv.gz"
notebooks_dir = Path("./../notebooks")
data_file_path = notebooks_dir.parent / "data" / data_filename

df = process_data(
    load_data(
        data_file_path,
        selected_start_date=pd.Timestamp(2019, 1, 1),
        selected_end_date=pd.Timestamp(2022, 12, 31),
    )
)

In [22]:

Copied!

%%time
cs_signals_df = cs_pattern_recognition(df=df)
%%time
cs_signals_df = cs_pattern_recognition(df=df)

CPU times: user 4.65 s, sys: 5.84 s, total: 10.5 s
Wall time: 10.5 s

In [23]:

Copied!

%%time
performance_metrics = cs_performance(cs_signals_df)
%%time
performance_metrics = cs_performance(cs_signals_df)

CPU times: user 87.1 ms, sys: 0 ns, total: 87.1 ms
Wall time: 86.9 ms

In [24]:

Copied!





# plot all patterns, ranked by number of instances
plot_cs_performance(
    df=performance_metrics,
    criterion="total_instances",
    title_suffix="across the whole data set.",
)

# plot the patterns, ranked by number of instances, with a true-positive rate >50%.
plot_cs_performance(
    df=performance_metrics.query("ci_lower > 0.5").sort_values(
        by="total_instances", ascending=False
    ),
    criterion="total_instances",
    title_suffix="with ci_lower > 50%.",
)
# plot all patterns, ranked by number of instances
plot_cs_performance(
    df=performance_metrics,
    criterion="total_instances",
    title_suffix="across the whole data set.",
)

# plot the patterns, ranked by number of instances, with a true-positive rate >50%.
plot_cs_performance(
    df=performance_metrics.query("ci_lower > 0.5").sort_values(
        by="total_instances", ascending=False
    ),
    criterion="total_instances",
    title_suffix="with ci_lower > 50%.",
)

Notable this time, appreciating more data, we indeed isntances where the lower part of the confidence interval is greater than 50%. This tells us we are 95% sure that these patterns indeed correctly predict the next day's intraday return. Conversely, we are now also in a position to identify counter-signals. These are instances where the upper part of the confidence interval is below the 50% threshold. Hence, it is indicated to take action for these signals in opposite direction of what they suggest, i.e. to take them as contrarian signals.

Let us visualise these reults in the following.

In [25]:

Copied!

performance_metrics.query("ci_lower > 0.5").sort_values(
    by=["ci_lower"], ascending=False
)
performance_metrics.query("ci_lower > 0.5").sort_values(
    by=["ci_lower"], ascending=False
)

Out[25]:

	TP	FP	total_instances	precision	center	margin	ci_upper	ci_lower	TP_wilson
candle
CDL3LINESTRIKE	254	190	444	0.572072	0.571454	0.045829	0.617282	0.525625	0.571454
CDLINVERTEDHAMMER	3030	2742	5772	0.524948	0.524931	0.012879	0.537810	0.512053	0.524931
CDLGRAVESTONEDOJI	2958	2733	5691	0.519768	0.519755	0.012976	0.532731	0.506779	0.519755
CDLUNIQUE3RIVER	409	348	757	0.540291	0.540087	0.035413	0.575500	0.504674	0.540087
CDLCOUNTERATTACK	342	287	629	0.543720	0.543455	0.038807	0.582262	0.504647	0.543455
CDLLONGLEGGEDDOJI	33980	32928	66908	0.507862	0.507861	0.003788	0.511649	0.504073	0.507861
CDLSEPARATINGLINES	1184	1076	2260	0.523894	0.523853	0.020573	0.544426	0.503280	0.523853
CDLHAMMER	5576	5303	10879	0.512547	0.512543	0.009391	0.521934	0.503152	0.512543
CDLRICKSHAWMAN	25188	24453	49641	0.507403	0.507403	0.004398	0.511800	0.503005	0.507403
CDLMATCHINGLOW	3900	3691	7591	0.513766	0.513759	0.011241	0.525000	0.502519	0.513759
CDLDOJI	34238	33455	67693	0.505783	0.505783	0.003766	0.509549	0.502017	0.505783
CDLSTICKSANDWICH	408	352	760	0.536842	0.536657	0.035362	0.572019	0.501295	0.536657

In [26]:

Copied!

performance_metrics.query("ci_upper < 0.5").sort_values(
    by=["ci_upper"], ascending=False
)
performance_metrics.query("ci_upper < 0.5").sort_values(
    by=["ci_upper"], ascending=False
)

Out[26]:

	TP	FP	total_instances	precision	center	margin	ci_upper	ci_lower	TP_wilson
candle
CDLSPINNINGTOP	53080	53849	106929	0.496404	0.496404	0.002997	0.499401	0.493408	0.496404
CDLMARUBOZU	11173	11518	22691	0.492398	0.492399	0.006504	0.498904	0.485895	0.492399
CDLBELTHOLD	36943	37701	74644	0.494923	0.494923	0.003587	0.498509	0.491336	0.494923
CDLSHOOTINGSTAR	3697	3929	7626	0.484789	0.484797	0.011214	0.496011	0.473583	0.484797
CDLLONGLINE	48672	50428	99100	0.491140	0.491141	0.003112	0.494253	0.488028	0.491141
CDLSHORTLINE	28193	29360	57553	0.489862	0.489862	0.004084	0.493946	0.485778	0.489862
CDLTRISTAR	320	383	703	0.455192	0.455436	0.036713	0.492148	0.418723	0.455436
CDLCLOSINGMARUBOZU	35159	37383	72542	0.484671	0.484672	0.003637	0.488308	0.481035	0.484672
CDLDARKCLOUDCOVER	1093	1259	2352	0.464711	0.464768	0.020140	0.484909	0.444628	0.464768
CDLPIERCING	862	1008	1870	0.460963	0.461043	0.022570	0.483612	0.438473	0.461043
CDL2CROWS	234	299	533	0.439024	0.439461	0.041982	0.481443	0.397479	0.439461
CDLENGULFING	17525	19420	36945	0.474354	0.474356	0.005091	0.479448	0.469265	0.474356
CDLEVENINGDOJISTAR	444	559	1003	0.442672	0.442891	0.030681	0.473572	0.412209	0.442891
CDL3OUTSIDE	8330	9687	18017	0.462341	0.462349	0.007279	0.469629	0.455070	0.462349
CDLEVENINGSTAR	1317	1601	2918	0.451337	0.451401	0.018044	0.469444	0.433357	0.451401

In [27]:

Copied!

plot_cs_performance(
    df=performance_metrics, criterion="TP_wilson", plot_performance=True
)
plot_cs_performance(
    df=performance_metrics, criterion="TP_wilson", plot_performance=True
)

Based on this analysis, we now name signals and contrarian signals:

In [28]:

Copied!

performance_metrics.query("ci_lower > 0.5").index  # signals
performance_metrics.query("ci_lower > 0.5").index  # signals

Out[28]:

Index(['CDLDOJI', 'CDLLONGLEGGEDDOJI', 'CDLRICKSHAWMAN', 'CDLHAMMER',
       'CDLMATCHINGLOW', 'CDLINVERTEDHAMMER', 'CDLGRAVESTONEDOJI',
       'CDLSEPARATINGLINES', 'CDLSTICKSANDWICH', 'CDLUNIQUE3RIVER',
       'CDLCOUNTERATTACK', 'CDL3LINESTRIKE'],
      dtype='object', name='candle')

In [29]:

Copied!

performance_metrics.query("ci_upper < 0.5").index  # anti signals
performance_metrics.query("ci_upper < 0.5").index  # anti signals

Out[29]:

Index(['CDLSPINNINGTOP', 'CDLLONGLINE', 'CDLBELTHOLD', 'CDLCLOSINGMARUBOZU',
       'CDLSHORTLINE', 'CDLENGULFING', 'CDLMARUBOZU', 'CDL3OUTSIDE',
       'CDLSHOOTINGSTAR', 'CDLEVENINGSTAR', 'CDLDARKCLOUDCOVER', 'CDLPIERCING',
       'CDLEVENINGDOJISTAR', 'CDLTRISTAR', 'CDL2CROWS'],
      dtype='object', name='candle')

If you want, you can test strategies that contain only those candlestick patterns that have proven to be profitable and/or those which manifsted themselves as anti-signals.

Also you now can implement your own Machine-Learning logics to see whether you can come up with your own logic. Also, you can run the logic on a more potent machine, to see how the precision and confidence intervals change per candlestick pattern.

Applying candlestick analysis across the S&P 500 universe¶

Unfortunately, the data provider does not have OHLC data for the S&P 500 index. At the time of writing, an inquiry is still ongoing. In the following, we illustrate how we obtain a synthetic performance reference nontheless, namely by computing the mean intraday return at any day acroos all the oniverse assuming equal weights. This shoudl serve as an approximate solution that works with the data at hand.

In [30]:

Copied!





# synthetic S&P 500 intraday performacne
df_reference_strategy = (
    df[["ticker", "date", "intraday_return"]]
    .pivot_table(index="date", columns="ticker")
    .mean(axis=1)
)
df_reference_strategy = df_reference_strategy.rename("intraday_return").reset_index()
df_reference_strategy["account_curve"] = (
    1 + df_reference_strategy["intraday_return"]
).cumprod()
df_reference_strategy["cumsumret"] = df_reference_strategy["intraday_return"].cumsum()
df_reference_strategy.plot(x="date", y="account_curve", figsize=(8, 8))
plt.show()

compute_trading_strategy_performance(df=df_reference_strategy, verbose=True);
# synthetic S&P 500 intraday performacne
df_reference_strategy = (
    df[["ticker", "date", "intraday_return"]]
    .pivot_table(index="date", columns="ticker")
    .mean(axis=1)
)
df_reference_strategy = df_reference_strategy.rename("intraday_return").reset_index()
df_reference_strategy["account_curve"] = (
    1 + df_reference_strategy["intraday_return"]
).cumprod()
df_reference_strategy["cumsumret"] = df_reference_strategy["intraday_return"].cumsum()
df_reference_strategy.plot(x="date", y="account_curve", figsize=(8, 8))
plt.show()

compute_trading_strategy_performance(df=df_reference_strategy, verbose=True);

Annualised strategy return [%]: 0.0637
Annualised strategy standard deviation of returns [%]: 0.1624
Sharpe ratio of strategy: 0.3922

However, we can also opt for a method that compares the buy-and hold approach against employign the naive candlestick strategy. Recall the naive candlestick approach was to sum up the signal for each day across all the candlestick patterns. We can then compare the Sharpe Ration for both, the buy-and-hold approach, as well as the active candlestick approach and compare which of them delivers higher risk-adjusted returns, if at all. This is carrie dout in the below, whearas we slightly modified the approach of the single-stock method from notebook 2.

Analysing Sharpe Ratios for a passive and acive trading strategy for the S&P 500 universe¶

We now turn our focus to comparing the Sharpe Ratios from both, passive and active trading strategies, examining their distribution through histograms, cumulative distribution functions, and box plots. Additionally, we employ specific functions for a detailed statistical analysis.

Precisely, analyse_Sharpe_Ratios_for_active_and_passive_strategies() visualizes their distributions, analyze_sharpe_ratios() delves into their statistical characteristics, and compare_sharpe_ratios() statistically determines if the active strategy's Sharpe Ratios significantly outperform those of a reference strategy.

In [31]:

Copied!





def analyse_Sharpe_Ratios_for_active_and_passive_strategies(
    SR_buy_and_hold: np.array, SR_naive_cs: np.array
) -> None:
    # determine the combined range of both Sharpe Ratios
    all_ratios = np.concatenate((SR_buy_and_hold, SR_naive_cs))
    min_edge = all_ratios.min()
    max_edge = all_ratios.max()
    bins = np.linspace(
        min_edge, max_edge, 40
    )  # 40 equal-width bins across the full range

    # compute empirical CDF for SR_buy_and_hold
    sorted_SR_buy_and_hold = np.sort(SR_buy_and_hold)
    yvals_buy_and_hold = np.arange(1, len(sorted_SR_buy_and_hold) + 1) / float(
        len(sorted_SR_buy_and_hold)
    )

    # compute empirical CDF for SR_naive_cs
    sorted_SR_naive_cs = np.sort(SR_naive_cs)
    yvals_naive_cs = np.arange(1, len(sorted_SR_naive_cs) + 1) / float(
        len(sorted_SR_naive_cs)
    )

    # plotting
    fig, axs = plt.subplots(3, 1, figsize=(10, 18))

    # histograms
    axs[0].hist(SR_buy_and_hold, bins=bins, alpha=0.5, label="Buy and hold")
    axs[0].hist(SR_naive_cs, bins=bins, alpha=0.5, label="Naive CS")
    axs[0].set_title("Comparison of Sharpe Ratios: Buy and hold vs. Naive CS")
    axs[0].set_xlabel("Sharpe Ratio")
    axs[0].set_ylabel("Frequency")
    axs[0].legend()

    # empirical CDFs
    axs[1].plot(
        sorted_SR_buy_and_hold,
        yvals_buy_and_hold,
        label="Buy and hold",
        marker=".",
        linestyle="none",
    )
    axs[1].plot(
        sorted_SR_naive_cs,
        yvals_naive_cs,
        label="Naive CS",
        marker=".",
        linestyle="none",
    )
    axs[1].set_title("Empirical CDF of Sharpe Ratios")
    axs[1].set_xlabel("Sharpe Ratio")
    axs[1].set_ylabel("CDF")
    axs[1].legend()
    axs[1].grid(True)

    # box Plots
    axs[2].boxplot([SR_buy_and_hold, SR_naive_cs], labels=["Buy and hold", "Naive CS"])
    axs[2].set_title("Box Plot of Sharpe Ratios")
    axs[2].set_ylabel("Sharpe Ratio")

    plt.tight_layout()
    plt.show()

    return None


def analyze_sharpe_ratios(SR_buy_and_hold: np.array, SR_naive_cs: np.array) -> None:
    print("Buy and Hold Strategy:")
    print(f"Mean Sharpe Ratio: {np.mean(SR_buy_and_hold):.4f}")
    print(f"Median Sharpe Ratio: {np.median(SR_buy_and_hold):.4f}")
    print(f"Kurtosis: {stats.kurtosis(SR_buy_and_hold):.4f}")
    print(f"Skewness: {stats.skew(SR_buy_and_hold):.4f}\n")

    print("Naive Candlestick Strategy:")
    print(f"Mean Sharpe Ratio: {np.mean(SR_naive_cs):.4f}")
    print(f"Median Sharpe Ratio: {np.median(SR_naive_cs):.4f}")
    print(f"Kurtosis: {stats.kurtosis(SR_naive_cs):.4f}")
    print(f"Skewness: {stats.skew(SR_naive_cs):.4f}")

    return None


def compare_sharpe_ratios(SR_buy_and_hold: np.array, SR_naive_cs: np.array) -> None:
    t_stat, p_value = stats.ttest_ind(
        SR_naive_cs, SR_buy_and_hold, alternative="greater"
    )

    print(f"t-statistic: {t_stat}")
    print(f"p-value: {p_value}")

    # Interpret the p-value
    if p_value < 0.05:
        print(
            "The naive candlestick strategy has significantly greater Sharpe Ratios than the buy-and-hold strategy at the 5% significance level."
        )
    else:
        print(
            "There is no significant difference in Sharpe Ratios in favour of the naive candlestick strategy over the buy-and-hold strategy at the 5% significance level."
        )

        return None
def analyse_Sharpe_Ratios_for_active_and_passive_strategies(
    SR_buy_and_hold: np.array, SR_naive_cs: np.array
) -> None:
    # determine the combined range of both Sharpe Ratios
    all_ratios = np.concatenate((SR_buy_and_hold, SR_naive_cs))
    min_edge = all_ratios.min()
    max_edge = all_ratios.max()
    bins = np.linspace(
        min_edge, max_edge, 40
    )  # 40 equal-width bins across the full range

    # compute empirical CDF for SR_buy_and_hold
    sorted_SR_buy_and_hold = np.sort(SR_buy_and_hold)
    yvals_buy_and_hold = np.arange(1, len(sorted_SR_buy_and_hold) + 1) / float(
        len(sorted_SR_buy_and_hold)
    )

    # compute empirical CDF for SR_naive_cs
    sorted_SR_naive_cs = np.sort(SR_naive_cs)
    yvals_naive_cs = np.arange(1, len(sorted_SR_naive_cs) + 1) / float(
        len(sorted_SR_naive_cs)
    )

    # plotting
    fig, axs = plt.subplots(3, 1, figsize=(10, 18))

    # histograms
    axs[0].hist(SR_buy_and_hold, bins=bins, alpha=0.5, label="Buy and hold")
    axs[0].hist(SR_naive_cs, bins=bins, alpha=0.5, label="Naive CS")
    axs[0].set_title("Comparison of Sharpe Ratios: Buy and hold vs. Naive CS")
    axs[0].set_xlabel("Sharpe Ratio")
    axs[0].set_ylabel("Frequency")
    axs[0].legend()

    # empirical CDFs
    axs[1].plot(
        sorted_SR_buy_and_hold,
        yvals_buy_and_hold,
        label="Buy and hold",
        marker=".",
        linestyle="none",
    )
    axs[1].plot(
        sorted_SR_naive_cs,
        yvals_naive_cs,
        label="Naive CS",
        marker=".",
        linestyle="none",
    )
    axs[1].set_title("Empirical CDF of Sharpe Ratios")
    axs[1].set_xlabel("Sharpe Ratio")
    axs[1].set_ylabel("CDF")
    axs[1].legend()
    axs[1].grid(True)

    # box Plots
    axs[2].boxplot([SR_buy_and_hold, SR_naive_cs], labels=["Buy and hold", "Naive CS"])
    axs[2].set_title("Box Plot of Sharpe Ratios")
    axs[2].set_ylabel("Sharpe Ratio")

    plt.tight_layout()
    plt.show()

    return None


def analyze_sharpe_ratios(SR_buy_and_hold: np.array, SR_naive_cs: np.array) -> None:
    print("Buy and Hold Strategy:")
    print(f"Mean Sharpe Ratio: {np.mean(SR_buy_and_hold):.4f}")
    print(f"Median Sharpe Ratio: {np.median(SR_buy_and_hold):.4f}")
    print(f"Kurtosis: {stats.kurtosis(SR_buy_and_hold):.4f}")
    print(f"Skewness: {stats.skew(SR_buy_and_hold):.4f}\n")

    print("Naive Candlestick Strategy:")
    print(f"Mean Sharpe Ratio: {np.mean(SR_naive_cs):.4f}")
    print(f"Median Sharpe Ratio: {np.median(SR_naive_cs):.4f}")
    print(f"Kurtosis: {stats.kurtosis(SR_naive_cs):.4f}")
    print(f"Skewness: {stats.skew(SR_naive_cs):.4f}")

    return None


def compare_sharpe_ratios(SR_buy_and_hold: np.array, SR_naive_cs: np.array) -> None:
    t_stat, p_value = stats.ttest_ind(
        SR_naive_cs, SR_buy_and_hold, alternative="greater"
    )

    print(f"t-statistic: {t_stat}")
    print(f"p-value: {p_value}")

    # Interpret the p-value
    if p_value < 0.05:
        print(
            "The naive candlestick strategy has significantly greater Sharpe Ratios than the buy-and-hold strategy at the 5% significance level."
        )
    else:
        print(
            "There is no significant difference in Sharpe Ratios in favour of the naive candlestick strategy over the buy-and-hold strategy at the 5% significance level."
        )

        return None

First approach: Taking into account all candlestick patterns¶

Now, we compute the Sharpe Ratios for the active, as well as passive (reference) strategy.

In [32]:

Copied!





%%time 

# we should loop through all of the tickers to create trading signals for each stock

StrategyPerformance = namedtuple(
    "StrategyPerformance", ["SR_buy_and_hold", "SR_naive_cs"]
)
naive_cs_vs_buy_and_hold_performance = {}

for ticker in tickers:
    df_single_stock = df[df["ticker"] == ticker]
    cs_single_stock_signals_df = cs_signals_df[cs_signals_df["ticker"] == ticker]

    trading_signal = (
        cs_single_stock_signals_df.query("cs_pattern != 0")
        .pivot_table(index="date", columns="candle", values="cs_pattern", aggfunc="sum")
        .sum(axis=1)
        .loc[lambda x: x != 0]
    )

    performance_trading_signals = (
        df_single_stock[
            df_single_stock["date"].isin(
                [date + pd.DateOffset(days=1) for date in trading_signal.index]
            )
        ][["date", "intraday_return"]]
        .assign(account_curve=lambda x: (1 + x["intraday_return"]).cumprod())
        .assign(cumsumret=lambda x: x["intraday_return"].cumsum())
        .assign(time_between_signals=lambda x: x["date"].diff().dt.days)
    )

    (_, _, SR_buy_and_hold) = compute_trading_strategy_performance(df=df_single_stock)
    (_, _, SR_naive_cs) = compute_trading_strategy_performance(
        df=performance_trading_signals
    )

    naive_cs_vs_buy_and_hold_performance[ticker] = StrategyPerformance(
        SR_buy_and_hold=SR_buy_and_hold, SR_naive_cs=SR_naive_cs
    )
%%time 

# we should loop through all of the tickers to create trading signals for each stock

StrategyPerformance = namedtuple(
    "StrategyPerformance", ["SR_buy_and_hold", "SR_naive_cs"]
)
naive_cs_vs_buy_and_hold_performance = {}

for ticker in tickers:
    df_single_stock = df[df["ticker"] == ticker]
    cs_single_stock_signals_df = cs_signals_df[cs_signals_df["ticker"] == ticker]

    trading_signal = (
        cs_single_stock_signals_df.query("cs_pattern != 0")
        .pivot_table(index="date", columns="candle", values="cs_pattern", aggfunc="sum")
        .sum(axis=1)
        .loc[lambda x: x != 0]
    )

    performance_trading_signals = (
        df_single_stock[
            df_single_stock["date"].isin(
                [date + pd.DateOffset(days=1) for date in trading_signal.index]
            )
        ][["date", "intraday_return"]]
        .assign(account_curve=lambda x: (1 + x["intraday_return"]).cumprod())
        .assign(cumsumret=lambda x: x["intraday_return"].cumsum())
        .assign(time_between_signals=lambda x: x["date"].diff().dt.days)
    )

    (_, _, SR_buy_and_hold) = compute_trading_strategy_performance(df=df_single_stock)
    (_, _, SR_naive_cs) = compute_trading_strategy_performance(
        df=performance_trading_signals
    )

    naive_cs_vs_buy_and_hold_performance[ticker] = StrategyPerformance(
        SR_buy_and_hold=SR_buy_and_hold, SR_naive_cs=SR_naive_cs
    )

CPU times: user 50 s, sys: 755 µs, total: 50 s
Wall time: 50 s

In [33]:

Copied!





SR_buy_and_hold = np.array(
    [
        performance.SR_buy_and_hold
        for performance in naive_cs_vs_buy_and_hold_performance.values()
        if not np.isnan(performance.SR_buy_and_hold)
    ]
)
SR_naive_cs = np.array(
    [
        performance.SR_naive_cs
        for performance in naive_cs_vs_buy_and_hold_performance.values()
        if not np.isnan(performance.SR_naive_cs)
    ]
)

analyse_Sharpe_Ratios_for_active_and_passive_strategies(
    SR_buy_and_hold=SR_buy_and_hold, SR_naive_cs=SR_naive_cs
)
SR_buy_and_hold = np.array(
    [
        performance.SR_buy_and_hold
        for performance in naive_cs_vs_buy_and_hold_performance.values()
        if not np.isnan(performance.SR_buy_and_hold)
    ]
)
SR_naive_cs = np.array(
    [
        performance.SR_naive_cs
        for performance in naive_cs_vs_buy_and_hold_performance.values()
        if not np.isnan(performance.SR_naive_cs)
    ]
)

analyse_Sharpe_Ratios_for_active_and_passive_strategies(
    SR_buy_and_hold=SR_buy_and_hold, SR_naive_cs=SR_naive_cs
)

In [34]:

Copied!

analyze_sharpe_ratios(SR_buy_and_hold=SR_buy_and_hold, SR_naive_cs=SR_naive_cs)
analyze_sharpe_ratios(SR_buy_and_hold=SR_buy_and_hold, SR_naive_cs=SR_naive_cs)

Buy and Hold Strategy:
Mean Sharpe Ratio: 0.2293
Median Sharpe Ratio: 0.2298
Kurtosis: 1.6132
Skewness: -0.3861

Naive Candlestick Strategy:
Mean Sharpe Ratio: 0.2325
Median Sharpe Ratio: 0.2617
Kurtosis: 1.7315
Skewness: -0.2999

Notably, both the mean and the median Sharpe Ratio for the naive candlestick approach (active trading strategy) are higher than for the passive (buy-and-hold) approach. Interestingly, also the skewness is slighly lower for the active approach. Recall that skewness values close to zero suggest a symmetrical distribution of returns around the mean. For the buy-and-hold strategy, a skewness of -0.3861 indicates a skew to the left, suggesting a distribution with a fatter left tail indicating more frequent negative extreme returns than positive ones. The candlestick approach's skewness of -0.2999 also indicates a leftward skew but to a lesser extent, suggesting a slightly more symmetric distribution of returns around the mean compared to the buy-and-hold strategy.

Unfortunately, the extra returns obtained form the candlestick approach do not come for free: They are also more risk as indicated by the larger kurtosis. Also note we did not assume any transation costs. For a trading company, a market maker, or a large bank, which are in a position to negotiate lower transaction costs than retail traders, these are less of an issue. However, an active approach involving daily transactions for a retail trader is disadvantageous, to say the least. The active candlestick strategy hence appears more suitable for risk seeking speculatnts who are in pursuit of "high-risk-high-return" bets.

However, it is noted that the outperformace of mean return, median return, and a smaller skew for the active candlestick approach are indeed interesting observations. In order to determine whether higher Sharpe Ratios for the candlestick approach are statistically significantly greater than the ones for the passive strategy, we perform a one-sided t-test.

In [35]:

Copied!

compare_sharpe_ratios(SR_buy_and_hold, SR_naive_cs)
compare_sharpe_ratios(SR_buy_and_hold, SR_naive_cs)

t-statistic: 0.1020279443073558
p-value: 0.4593775985116076
There is no significant difference in Sharpe Ratios in favour of the naive candlestick strategy over the buy-and-hold strategy at the 5% significance level.

Second approach: Filter only statisticaly significant cs patterns¶

In [36]:

Copied!





positive_signals = performance_metrics.query("ci_lower > 0.5").index  # signals
counter_signals = performance_metrics.query(
    "ci_upper < 0.5"
).index  # anti-signals/contrarians
positive_signals = performance_metrics.query("ci_lower > 0.5").index  # signals
counter_signals = performance_metrics.query(
    "ci_upper < 0.5"
).index  # anti-signals/contrarians

In [37]:

Copied!





%%time 

StrategyPerformance = namedtuple(
    "StrategyPerformance", ["SR_buy_and_hold", "SR_naive_cs"]
)
naive_cs_vs_buy_and_hold_performance = {}

for ticker in tickers:
    df_single_stock = df[df["ticker"] == ticker]
    cs_single_stock_signals_df = cs_signals_df[cs_signals_df["ticker"] == ticker]

    # create a copy for modification
    filtered_signals_df = cs_single_stock_signals_df.copy()

    # apply the filter directly to this copy
    filter_mask = filtered_signals_df.index.get_level_values("candle").isin(
        positive_signals.union(counter_signals)
    )
    filtered_signals_df = filtered_signals_df.loc[filter_mask]

    # adjust 'cs_pattern' by multiplying by -1 for counter signals
    counter_signals_mask = filtered_signals_df.index.get_level_values("candle").isin(
        counter_signals
    )
    filtered_signals_df.loc[counter_signals_mask, "cs_pattern"] *= -1

    trading_signal = (
        filtered_signals_df.query("cs_pattern != 0")
        .pivot_table(index="date", columns="candle", values="cs_pattern", aggfunc="sum")
        .sum(axis=1)
        .loc[lambda x: x != 0]
    )

    performance_trading_signals = (
        df_single_stock[
            df_single_stock["date"].isin(
                [date + pd.DateOffset(days=1) for date in trading_signal.index]
            )
        ][["date", "intraday_return"]]
        .assign(account_curve=lambda x: (1 + x["intraday_return"]).cumprod())
        .assign(cumsumret=lambda x: x["intraday_return"].cumsum())
        .assign(time_between_signals=lambda x: x["date"].diff().dt.days)
    )

    (_, _, SR_buy_and_hold) = compute_trading_strategy_performance(df=df_single_stock)
    (_, _, SR_naive_cs) = compute_trading_strategy_performance(
        df=performance_trading_signals
    )

    naive_cs_vs_buy_and_hold_performance[ticker] = StrategyPerformance(
        SR_buy_and_hold=SR_buy_and_hold, SR_naive_cs=SR_naive_cs
    )
%%time 

StrategyPerformance = namedtuple(
    "StrategyPerformance", ["SR_buy_and_hold", "SR_naive_cs"]
)
naive_cs_vs_buy_and_hold_performance = {}

for ticker in tickers:
    df_single_stock = df[df["ticker"] == ticker]
    cs_single_stock_signals_df = cs_signals_df[cs_signals_df["ticker"] == ticker]

    # create a copy for modification
    filtered_signals_df = cs_single_stock_signals_df.copy()

    # apply the filter directly to this copy
    filter_mask = filtered_signals_df.index.get_level_values("candle").isin(
        positive_signals.union(counter_signals)
    )
    filtered_signals_df = filtered_signals_df.loc[filter_mask]

    # adjust 'cs_pattern' by multiplying by -1 for counter signals
    counter_signals_mask = filtered_signals_df.index.get_level_values("candle").isin(
        counter_signals
    )
    filtered_signals_df.loc[counter_signals_mask, "cs_pattern"] *= -1

    trading_signal = (
        filtered_signals_df.query("cs_pattern != 0")
        .pivot_table(index="date", columns="candle", values="cs_pattern", aggfunc="sum")
        .sum(axis=1)
        .loc[lambda x: x != 0]
    )

    performance_trading_signals = (
        df_single_stock[
            df_single_stock["date"].isin(
                [date + pd.DateOffset(days=1) for date in trading_signal.index]
            )
        ][["date", "intraday_return"]]
        .assign(account_curve=lambda x: (1 + x["intraday_return"]).cumprod())
        .assign(cumsumret=lambda x: x["intraday_return"].cumsum())
        .assign(time_between_signals=lambda x: x["date"].diff().dt.days)
    )

    (_, _, SR_buy_and_hold) = compute_trading_strategy_performance(df=df_single_stock)
    (_, _, SR_naive_cs) = compute_trading_strategy_performance(
        df=performance_trading_signals
    )

    naive_cs_vs_buy_and_hold_performance[ticker] = StrategyPerformance(
        SR_buy_and_hold=SR_buy_and_hold, SR_naive_cs=SR_naive_cs
    )

CPU times: user 48.8 s, sys: 9.18 ms, total: 48.8 s
Wall time: 48.9 s

In [38]:

Copied!





SR_buy_and_hold = np.array(
    [
        performance.SR_buy_and_hold
        for performance in naive_cs_vs_buy_and_hold_performance.values()
        if not np.isnan(performance.SR_buy_and_hold)
    ]
)
SR_naive_cs = np.array(
    [
        performance.SR_naive_cs
        for performance in naive_cs_vs_buy_and_hold_performance.values()
        if not np.isnan(performance.SR_naive_cs)
    ]
)
SR_buy_and_hold = np.array(
    [
        performance.SR_buy_and_hold
        for performance in naive_cs_vs_buy_and_hold_performance.values()
        if not np.isnan(performance.SR_buy_and_hold)
    ]
)
SR_naive_cs = np.array(
    [
        performance.SR_naive_cs
        for performance in naive_cs_vs_buy_and_hold_performance.values()
        if not np.isnan(performance.SR_naive_cs)
    ]
)

In [39]:

Copied!

analyze_sharpe_ratios(SR_buy_and_hold=SR_buy_and_hold, SR_naive_cs=SR_naive_cs)
analyze_sharpe_ratios(SR_buy_and_hold=SR_buy_and_hold, SR_naive_cs=SR_naive_cs)

Buy and Hold Strategy:
Mean Sharpe Ratio: 0.2293
Median Sharpe Ratio: 0.2298
Kurtosis: 1.6132
Skewness: -0.3861

Naive Candlestick Strategy:
Mean Sharpe Ratio: 0.2822
Median Sharpe Ratio: 0.3150
Kurtosis: 0.7447
Skewness: -0.2592

In [40]:

Copied!

compare_sharpe_ratios(SR_buy_and_hold, SR_naive_cs)
compare_sharpe_ratios(SR_buy_and_hold, SR_naive_cs)

t-statistic: 1.6637194409255192
p-value: 0.0482428136676386
The naive candlestick strategy has significantly greater Sharpe Ratios than the buy-and-hold strategy at the 5% significance level.

Recall from above, the naive Candlestick strategy utilizing all patterns was characterised by the following performacne metrics:

Mean Sharpe Ratio: 0.2325
Median Sharpe Ratio: 0.2617
Kurtosis: 1.7315
Skewness: -0.2999

Conclusion¶

Upon filtering the candlestick signals and contrarian signals that were found to be indicating a price move on the next day at the 5% significance level, we could improve the all performance statistics, based on the Sharpe Ratio, in all four categories examined. The full set consists of 61 candlestick patterns, whereas the filtered approach consists of 12 signals and 15 counter signals.

The signals were identified at the 5% significance level to be:

Index(['CDLDOJI', 'CDLLONGLEGGEDDOJI', 'CDLRICKSHAWMAN', 'CDLHAMMER',
       'CDLMATCHINGLOW', 'CDLINVERTEDHAMMER', 'CDLGRAVESTONEDOJI',
       'CDLSEPARATINGLINES', 'CDLSTICKSANDWICH', 'CDLUNIQUE3RIVER',
       'CDLCOUNTERATTACK', 'CDL3LINESTRIKE'],
      dtype='object', name='candle')

The contrarian signals were identified at the 5% significance level to be:

Index(['CDLSPINNINGTOP', 'CDLLONGLINE', 'CDLBELTHOLD', 'CDLCLOSINGMARUBOZU',
       'CDLSHORTLINE', 'CDLENGULFING', 'CDLMARUBOZU', 'CDL3OUTSIDE',
       'CDLSHOOTINGSTAR', 'CDLEVENINGSTAR', 'CDLDARKCLOUDCOVER', 'CDLPIERCING',
       'CDLEVENINGDOJISTAR', 'CDLTRISTAR', 'CDL2CROWS'],
      dtype='object', name='candle')

Notably, the filtered candlestick approach outperforms the naive buy strategy in all four performance categories investigated. Moreover, a one-sided t-test revealed that the Sharpe Ratios obtained by the filtered candlestick approach is greated than those obtained by the naive buy approach at the 5% level.

In further research, one could run the very same code on a more potent machine and simply select a longer data range when loading in the data, to see whether the results reported here still hold. Also, one could attempt an expanding-window approach regarding the considered time-frame to investigate how performance changes in time and whether there are stocks for which the candlestick approach works particularly good or bad. The data source considered for this analysis were the past two years for all S&P 500 components, although for some stocks, there exists data dating back to the 1980ies. Assuming a densely populated data this equates to an upper boundary of 20 000 years of daily stock OHLC data.

It remains, however, that the level of analysis carried out and presented here, required access to proprietary data and significant computing power. These could be, for example, the High Performance Computing (HPC) facilities at Imperial College, or a potent private institution. Moreover, an active trading approach is predominantly aimed at players like large hedge funds and investment banks which still have proprietary trading teams, and that are in a position to negotiate low transaction costs. They should also be potent enough to observe, and act, on data streams across the entire S&P 500 universe. An extention to any other index, such as the STOXX600 or any Asian index is easily diable using the existing code.

For fund managers, with a more passive approach, the presented analysis can be interesting to optimise their entry points opon which to accummulate or offload positions.

For brokers, the presented analysis is useful to craft an arrival strategies to which to adopt their execution logics based on an oppinion whether a stock goes up or down. In case of no signal on a particularly stock, one then would simply fall back to a default behaviour.

END¶

Appendix¶

ML approach¶

The author cannot run the ML approach as we do not have enough memory available to load the required history of the stocks. ML methods are inherently data hungy, so loading just two years of data per stock will not be enough for meaningful results. Also, we cannot mix the history of one stock with the history of another, as financial data is chronological in nature.