Author: Carlos Salas Najera and Chris Kantos
This article provides a comprehensive picture of the importance of Natural Language Processing (NLP) and its significant potential to enhance investment strategies, with particular focus on comparing classifiers using conference call transcripts with multiple NLP approaches such as Loughran-McDonald sentiment dictionary, FinBERT model and Alexandria Technology’s ML Ensemble.
The NLP Revolution within the Investments Industry
NLP has shaken the foundations of the investment industry, especially over the last five years. In this way, NLP can summarise and normalise structured and unstructured data from varied sources to aid analysts to efficiently evaluate investment ideas ex-ante (signal generation) and ex-post (performance attribution). As a result, investment decision-makers can focus on value-added parts of the analytical process; and stop misallocating time on non-productive tasks such as collecting data or attend long-lasting conference calls without a clear purpose.
NLP adoption rate has been slow when compared to other industries such as e-commerce. Prospects for NLP vendors are bullish with the global NLP market size expected to grow to $35.1 billion by 2026 at a +20.3% CAGR MarketsAndMarkets). Despite being late adopters, investment management firms have started to react over the last decade with NLP initiatives gaining acceptance, with more than half of investment managers planning to increase their NLP capabilities in the short term (source: Deloitte).
The organisational challenges to integrate an NLP strategy should not be underestimated. Perhaps the most important parts of the puzzle are to have a skilled in-house talent and an adequate corporate culture. In this way, an organisation must not only be able to hire good talent with state-of-the-art NLP skills, but also ensure there is regular training available so that their NLP employees are fully aware of the latest breakthroughs. More importantly, the corporate culture commitment to a NLP strategy must be governed by a technologically-driven mindset from the very top of the management chain to the latest employee of the firm. Organisations whose leaders do not exhibit conviction in their technological assets during their decision-making process are destined to fail in the implementation of their NLP and ML (Machine Learning strategy).
Past Research on NLP applied to Corporate Documents: pre-BERT Era
The body of research on NLP applied to investment purposes has been growing dramatically over the last five years with the ascent of Machine Learning, computer power and brand new models changing the rules of the game such as BERT. Nevertheless, the number of NLP-related papers in the first part of the century is scarce and suffering from a myriad of issues such as small samples and short period bias. Having said this, there are a number of interesting papers which have been published after the turn of the century, including;NLP signals are more statistically significant and efficient for stocks with a remarkable growth style bias such as technology stocks. Moreover, NLP seems to be more effective triggering alpha-effective signals for non-dividend paying stocks. This suggests NLP signals could be more effective when dealing with “bad quality” stocks or companies with a weaker cash flow generation profile or no dividend discipline.
- Analyst’s Q&A conference call period is longer for bad performing stocks. Moreover, firms with low text complexity (low Fog Index) have more persistent positive earnings, whereas “bad” companies use higher vocabulary complexity to hide poor performance. In fact, using NLP to identify usage of deceptive language by the CFO can generate significant negative alpha, yet overall CEOs sentiment signals seem to be more reliable when testing long and short signals.
- The annual report “MD&A” section contains the most useful forward-looking statements for NLP. Not only is this section relevant for predicting stock returns, but when used alongside conference call transcripts can predict accruals, which has been traditionally evidence of bad or good accounting by corporate firms.
- A conference call Q&A’s tone has significant predictive ability for post-earnings-announcement drift with Q&A’s tone changes from the prior to the present call transcript yielding significant alpha insights. It’s well-documented that strong a divergence of tone between analysts and management teams during the Q&A session are a bad omen for a stock. i.e. it conveys higher uncertainty due to the discrepancy of opinions. Lastly, a Q&A section negative tone takes more time to be priced by the market due to behavioural bias inherent within investors (e.g. anchoring and conservatism bias), but it eventually is priced more significantly in magnitude than when a positive tone is registered.
- Analysts identified in a conference call experience a superior forecast ability. Better yet, closed conference calls seem to improve analysts’ ability to forecast future earnings accurately.
- NLP research also supports evidence on short-sellers reaping significant alpha and absolute returns by targeting firms with a simultaneous high earnings surprise, abnormally high management tone, short-term myopic vocabulary and management teams blaming external factors for bad operating performance.
- Combining NLP signals from different documents and sources has been predictive with low correlation between signals. Authors have found that significant wording changes in annual and quarterly reports filings trigger a statistically significant short signal about a stock.
Evolution of NLP Word Embedding Methods
The number of breakthroughs over the last decade in the NLP space has been a gamechanger. The perfect storm of increasing available computer power, big data access, Machine Learning and researchers creativity has produced important advances in the NLP research arena, particularly in the area of word embedding methods. Word embedding methods can be defined as a process whereby text information can be transformed into numerical data that may be more easily digested as features by machine learning models. The next figure is a neat summary of the natural evolution in word embedding methods from BOW in its early days to BERT nowadays.
Bag-of-words (BOW) is the simplest word embedding model. It consists basically in counting the frequency of each word in each document and represents documents as word vectors. The output of this process is a document-term matrix (DTM) that can be used to compare documents or use their token content to calculate similarity scores to classify documents. That said, BOW suffers from several pitfalls such as focusing exclusively on intra-document word frequency and, especially, an acute increase in the dimensionality of the dataset as the number of words increases since each word becomes a new feature.
TF-IDF (term frequency-inverse document frequency) aims to tackle BOW’s weaknesses by focusing on the relevance of a token not only at a local level within a document, but also at a global level across many documents. TF-IDF in a nutshell is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. It’s calculated as the product between the local importance AKA term frequency (TF), and the global inverse-document-frequency (IDF). A high TF-IDF is reached by a high TF, and a high IDF is resulted from a low document frequency of the term in the whole collection of documents. This mechanism allows TF-IDF to filter out common words also known stop words such as articles, prepositions, pronouns, or conjunctions.
Despite using several tricks to reduce dimensionality, TF-IDF is still prone to suffer from the dimensionality curse aforementioned. Moreover, both BOW and TF-IDF are not able to factor semantics i.e. actual meaning of the terms. Therefore, Word2Vec was able to factor text meaning and reduce dimensionality by using a 2-layer ANN (Artificial Neural Network) transformer. Doc2Vec is a natural extension of Word2Vec that can be applied to entire pieces of text and considers the sequential order of the words within a paragraph.
Although Word2Vec was an amazing milestone for word embedding methods, it still suffered from fundamental weaknesses the NLP community was eager to tackle. Enter BERT (Bidirectional Encoder Representations from Transformers). BERT was created and published in 2018 by Jacob Devlin and his colleagues from Google, and has become a ubiquitous baseline transformer-based method in NLP research papers. The greatest feature of BERT compared to prior models is its ability to factor polysemy – the different meanings of a word or sentence – while also taking into account context-specific word embeddings. Furthermore, BERT can use pre-trained configurations and finetuned with less resources for specific purposes. For instance, a direct application of BERT with the purpose of generating investment signals is not recommended due to the specialised language and lack of labelled data. As a result, a pre-trained language model that requires fewer labelled examples and trained on investment-specific corpora is recommended: enter FinBERT, an extension of the BERT especially designed to tackle investment-related NLP challenges.
Hands-On NLP: From Text Data to NLP Alpha
The next figure shows the workflow of a standard investment process using NLP signals. The first step is to efficiently gather text data from multiple sources including conference call transcripts, financial news, corporate filings or social media. After a diligent pre-processing is conducted cleaning and wrangling the text data, a second step focused on the generation of features from word embeddings is essential to transform the input into a format that ML models can more easily deal with to generate either sentiment analysis, entity recognition or theme tagging. Finally, this output can be used for multiple purposes within an investment process including generation of investment signals, risk management, or portfolio optimisation insights.
This section will be mainly based on comparing multiple NLP classifiers with the aim of generating investment signals for equity strategies. The three different classifiers to be compared are Loughran-McDonald sentiment dictionary, FinBERT model and Alexandria Technology’s ML Ensemble with their key features pointed out in the next table. Empirical evidence has shown that these three approaches exhibit a significantly low correlation in their recommendations with FinBERT and Loughran McDonald being the most correlated pair with a coefficient of correlation of 0.41.
Loughran McDonald (LM), developed by Tim Loughran and Bill McDonald of Notre Dame in 2011, is arguably the most popular financial lexicon method available. Lexicon, often called the “bag of words” approach to NLP uses a dictionary of words or phrases that are labelled with sentiment. The motivation for the LM dictionary was that in testing the popular and widely used Harvard Dictionary, negative sentiment was regularly mislabelled for words when applied in a financial context. The LM dictionary was built from a large sample of 10 Q’s and 10 K’s from the years 1994 to 2008. The sentiment was trained on around 5,000 words from these documents and over 80,000 from the Harvard Dictionary.
FinBERT is a specialised version of the machine learning model BERT. The BERT model is pretrained on both language modelling and next sentence prediction, which provides a framework for contextual relationships among words. FinBERT was developed in 2019 to extend the BERT model for better understanding under financial contexts. FinBERT uses the Reuters TRC2-financial corpus as the pre-training dataset. It is a subset of broader TRC2 dataset that is filtered for financial keywords and phrases. The resulting dataset is 46,143 documents consisting of 29 million words and 400 thousand sentences. For sentiment training, FinBERT uses Financial Phrasebank, developed in 2014 by Malo et al, consisting of 4845 sentences in the LexusNexus database labelled by 16 financial professionals. Lastly, the FiQA Sentiment dataset which consists of 1,174 financial news headlines completes the FinBERT sentiment training.
Alexandria Technology uses an ensemble of Machine Learning techniques including Support Vector Machines and Maximum Entropy. Developed in 2007, the model was based on technology developed in understanding the human genome and tailored for the application in the institutional investment industry. With the goal of applying analyst behaviour to big data, Alexandria’s language model is trained from earnings transcripts provided by FactSet. The sentiment training for Alexandria is based off over 200 thousand unique labels from a quorum of financial analysts reading earnings call transcripts.
NLP Strategies: A Comparison
The universe of companies used for the analysis is the S&P 500, using a dataset that accounts for survivorship bias and special corporate events. A long-short strategy is implemented for each of the NLP approaches for the period January 2010 to September 2021 using a monthly rebalancing. The study uses an aggregated window of six months tantamount to two conference call transcripts. These earnings calls transcripts are separated into two sections, the Management Discussion (MD) and the Question and Answer (QA), with each one of these sections being totalled for each transcript to get a net sentiment of positive (1), neutral (0) or negative (-1) using a log transformation to achieve a mathematically well-behaved net sentiment metric:
NetSentiment = Log10 (CountPositive+1/CountNegative+1)
Finally, the average of the net sentiment of MD and QA is used as the aggregate sentiment score for each stock, which will be used to create quantile portfolios with the results shown in the next figure:
Alexandria performed the best in all bar the first two years of the simulation. As seen in the chart above all performed negatively for the year 2016. One hypothesis is this coincided with the brief junk rally, where companies with positive earnings calls fell out of favour for those with neutral to negative earnings calls, resulting in the negative performance shown. One area of further research will be to test whether the language classifier training is the basis for some of the poor performance of LM and FinBERT, as they are both trained using news, whereas Alexandria’s classifier is trained uniquely using earnings call transcripts.
Finally, we looked at correlations between the models at the individual earnings call section level, and then at the aggregate security level. Among individual section classifications (ie. Topics, Sentences), we found the highest correlation to be between LM and FinBert at .38. Alexandria had the lowest correlations with both LM and FinBERT at .14 and .17 respectively. We then aggregated the individual sections into net sentiment values for each security in the sample. Unsurprisingly, when aggregated the correlations rose, with the highest correlation being between LM and Alexandria at .45, the lowest being FinBERT and Alexandria at .3 and FinBERT and LM having correlation of .41.
One last study looked at the t effect of the six month lookback in comparison to a threemonth lookback capturing only one conference call instead of two. The results showed that while the three month lookback performed slightly worse throughout the entire sample, using a more dynamic three month window performed better during periods of market rotation such as the beginning of 2022 away from Technology and into Energy, as illustrated below.
NLP Strategies: Generating Differentiated Alpha
The last section provided a glimpse on how Alexandria’s long-short portfolio outperformed the two other alternatives. However, the most important question is to understand whether or not NLP strategies offer a differentiated source of alpha compared to traditional risk factors which are easily investable nowadays. A preliminary correlation analysis of the returns between traditional risk factors and Alexandria’s strategy offers a good hint that NLP signals seem to be offering a different risk flavour with the exception of its mild positive exposure to momentum.
For this purpose, the returns of Alexandria’s long-short strategy returns using a three month lookback window are decomposed using four well-known multi-factor models (source: Keneth R. French) in order to obtain a measurement of alpha:
- Fama–French three-factor model (FF3F) using the traditional market premium (Mkt-RF), size premium (SMB) and value premium (HML).
- Carhart Model (Carhart) using FF3F plus a momentum factor (WML).
- Fama–French three-factor model (FF5F) using FF3F plus two additional factors measuring profitability (RMW) and investment intensity (CMA) factors.
- Six-factor model (6F) using same factors as FF5F plus the momentum factor (WML).
The results are displayed in the next figure with the table reproducing different metrics for the four different models: coefficient values, coefficients statistically significant test p-values, and overall model performance metrics such as adjusted R2, AIC or BIC. Overall, the most important conclusion is that the NLP-driven Alexandria’s long-short strategy seems to be delivering positive and statistically significant alpha as measured by “const_coef” and “const pval” rows.
Particularly, Alexandria’s alpha delivers a gross return ranging from 62 bps to 74 bps per month tantamount to annualised compounded returns within the 7.7% to 9.3% range. Hence, the strategy alpha seems to survive the acid test regardless of the multifactor model used for measuring alpha.
Other relevant insights from the previous table are the negative exposure of the strategy to value and size factors; whereas it’s positively exposed to the momentum factor (WML) as expected due to the fundamental nature of the NLP signals underlying the strategy.
That said, the previous analysis is an aggregated cumulative analysis for the period 2010-2022. Therefore, there are two main questions at this stage that researchers can ask themselves. Firstly, whether alpha is consistent throughout time or if it suffers from an acute regime-dependence fragility. Secondly, whether or not there are any signs of alpha decay, the loss of prediction power of an investment strategy over time.
To answer these two intriguing questions, the next plot displays a 6F model rolling coefficient analysis using a lookback window of 18 months. The results are rather reassuring with Alexandria’s long-short strategy alpha (blue line) delivering oscillating but consistent positive returns throughout the time. Furthermore, it seems that there is no trace of alpha decay based on the increasing importance of alpha during the post-Covid era. Yet it seems NLP signals are, as expected, less effective when macroeconomic or geopolitical risk factors are driving overall financial markets sentiment.
This article has provided a short introduction to NLP, word embedding methods and a preliminary analysis of how to apply NLP to generate investment signals from conference call transcripts. Although the real-world experiment showcased in this article has delivered a set of positive preliminary results, investors could develop an alternative ML model framework to enhance these results, and implement advanced cross-validation techniques such as Combinatorial Purged Cross-Validation (CPCV) to increase the robustness and reliability of their NLP strategies.
To sum up, there is evidence that NLP models applied to earnings calls transcripts can generate alpha not explained by traditional risk and return factors. Furthermore, the analysis outcome displayed a low level of correlation in the sentiment labelling between the three different discussed approaches, which result in large performance differences arising from them due to their heterogenous theoretical backgrounds, training process, and NLP technology of each approach.
- International Overconfidence in International Stock Prices.Scott, Stumpp, Xu JOPM (2003).
- Managerial Disclosure vs. Analyst Inquiry.Matsumoto, Pronk, Roelofsen (2006).
- Annual report readability, current earnings, and earnings persistence.Li (2006).
- Does Silence Speak? An Empirical Analysis of Disclosure Choices during Conference Calls .Hollander, Pronk, Roelofsen (2008).
- Predicting Risk from Financial Reports with Regression.Kogan, Routledge, Levin, Sagi, Smith (2009).
- Earnings Conference Call Content and Stock Price: The Case of REITs .Doran, Peterson, McKay Price (2010).
- Detecting Deceptive Discussions in Conference Calls .Larcker, Zakolyukina (2011).
- Using Earnings Conference Calls to Identify Analysts with Superior Private Information .Mayew, Sharp, Venkatachalam (2011).
- The Eﬀect of Conference Calls on Analysts’ Forecasts - German Evidence.Bassemir, Farkas, Pachta (2011).
- The Power of Voice: Managerial Affective States and Future Firm Performance .Mayew, Venkatachalam (2011).
- Earnings conference calls and stock returns: The incremental informativeness of textual tone .McKay Price, Doran, Peterson, Bliss (2011).
- Measuring Readability in Financial Disclosures.Loughran, McDonald (2011).
- "Do Sophisticated Investors Interpret Earnings Conference Call Tone
- Differently than Investors at Large?Evidence from Short Sales .Blau, DeLisle, McKay Price (2012).
- The Effect of Manager-specific Optimism on the Tone of Earnings Conference Calls.Davis, Ge , Matsumoto, Zhang (2012).
- Can Investors Detect Managers’ Lack of Spontaneity? Adherence to Pre-determined Scripts during Earnings Conference Calls .Lee (2014).
- The Blame Game. Zhou (2014).
- Differences in Conference Call Tones:Managers versus Analysts .Brockman, Li, McKay Price (2014).
- Wisdom of Crowds: The Value of Stock Opinions Transmitted Through Social Media.Chen, De,Hu,Hwang (2014).
- Using Unstructured and Qualitative Disclosures to Explain Accruals .Frnakel, Jennings, Lee (2015).
- Finding Value in Earnings Transcripts Data with AlphaSense .Jha, Blaine, Montague (2015).
- Speaking of the short-term: disclosure horizon and managerial myopia .Brochet, Loumioti, Serafeim(2015).
- Founders vs Professional CEOs.Lee, Hwang, Chen (2015).
- Reading Managerial Tone: How Analysts and the Market Respond to Conference Calls .Druz, Wagner, Zeckhauser(2016).
- Linguistic Complexity in Firm Disclosures:Obfuscation or Information? .Bushee, Gow, Taylor (2016).
- Capital market consequences of language barriers in the conference calls of non-U.S. firms.Brochet, Naranjo, Yu (2016).
- Words versus Deeds: Evidence from Post-Call Manager Trades .Brockman, Cicon,Li, S. McKay Price(2017).
- The Effects of Conference Call Tone onMarket Perceptions of Value Uncertainty .Borochina, Ciconb, DeLislec, McKay Price (2017).
- Natural Language Processing – Part II: Stock Selection.S&P Global Market Intelligence (2018).
- Their Sentiments Exactly: Sentiment Signal Diversity Creates Alpha Opportunity.S&P Global Market Intelligence (2018).
- Lazy Prices.Lauren Cohen, Christopher Malloy, Quoc Nguyen (2019).
- FinBERT: Financial Sentiment Analysis with Pre-trained Language Models, D.T. Araci (2019).
Carlos Salas, CFA, Co-Founder, iLuminar AM
Chris Kantos, Managing Director, EMEA, Alexandria Technology