Author: Carlos Salas Najera
In a previous article on building a data science skill-set, I wrote on how to adopt a realistic learning path to build a strong Data Science skill-set without having to suffer from significant knowledge gaps or time-consuming frustration. Furthermore, I promised to cover finance-specific libraries in future posts that can allow investment researchers to improve their time management. This article is oriented towards finance professionals of any age and level of experience willing to learn more about open-source third party libraries. If the reader has not read the first article of this series, it can be found by following this link.
What are Finance-Specific Libraries?
The open-source nature of Python results in plenty of third-party library options. The main objective of this second article is to provide several recommendations on when and how to use these third-party Python libraries and deal with the inherent risks of using them within a corporate workflow. Another objective of this article is to provide a list of library choices to stream financial and generic data currently available and widely used by finance industry practitioners.
This list is a guide and an introduction to finance-specific libraries that are easy to grasp, especially by Python intermediate users. Future articles of this series will cover other library recommendations with a focus on specific finance topics e.g. portfolio optimisation, performance analysis, risk management, derivatives valuation, etc.
Clarifying Libraries, Packages, Modules, and Frameworks
Libraries written in Python play a critical role in machine learning and data science. The term library is normally used interchangeably with other similar terms such as module, or package in popular Python community forums such as stackoverflow.com. Nevertheless, a necessary distinction should be made between them to avoid confusion:
- Python modules: A module is basically a bunch of related code saved in a file with the extension “.py”. User-defined functions can be stored into “.py” files to be reusable in new projects. In addition, built-in modules are also available in base Python such as the module Random, Datetime or re.
- Python packages: A package is simply a collection of modules organised using a hierarchical structure. Packages contain a file named “__init__.py.” that initialises the code for the corresponding package. Some examples are Numpy, or Pandas.
- Python Libraries: A library is a collection of related modules and packages. The term “library” is used interchangeably with the word “package” since the latter can also contain modules and other packages. Some illustrations of libraries are Matplotlib, PyTorch, Requests; among many others.
- Python Frameworks: Frameworks are similar to libraries as they both are a collection of modules and packages. That said, libraries are used for specific operations (e.g. Deep Learning specific such as PyTorch), whereas frameworks contain the basic building blocks and architecture of the application.For instance, popular frameworks in Python are Django, or Flask.
From now onwards this article will use the term “library” to refer to all the aforementioned terms. In this way, libraries can be built-in within the basic version of Python (e.g. datetime, os, platform) or developed by third-party users that can make them completely open-source or restricted to fee-paying subscribers. The majority of the libraries found in the Python ecosystem are freely accessible.
Main Recommendations when Adding Third Party Libraries
There are several dimensions to ponder when considering using Python Libraries, especially when adopting libraries within a corporate environment. Data Scientists working for a bank, investment manager, hedge fund or pension face similar challenges when adopting open-source third-party libraries within their workflow.
Firstly, the data scientist team must consider the organisational advantage of using a third-party open-source library over a proprietary one. A good strategy used by large data science teams is to start using a reliable third-party library while the team internally develops a proprietary library that eventually replaces the former. This strategy will allow the team to thoroughly understand the details of the external library’s source code and implement changes, allowing them to reproduce a bespoke version of the library that can be used for the specific needs of the organisation. Regrettably, small teams might not be able to follow this “parallel” strategy due to lack of resources, which will force them to rely on third party open-source or subscription service libraries. In the latter case, the team must perform regular due diligence of the deployed third-party libraries to ensure that they meet the organisation’s operational and security standards.
Furthermore, security is a paramount concern for any kind of organisation, particularly private institutions concerned about their intangible properties or client-related data. The use of third-party open-source libraries adds another source of security risk as this article points out. The deployment of Python’s third-party libraries poses a significant risk to any kind of organisation, specifically during their installation and updating process. In this way, the best practice followed by IT teams in corporate organisations is to:
i) Set up an internal sandbox environment for the organisation’s users where specific versions of Python third-party libraries are allowed to exist.
ii) Limit and monitor the stream of the data downloads during the installation and update of libraries overseen by the IT team.
iii) Prohibit outbound data flows.
Operational risk is another important trait to consider when using third party libraries. A library might become an important pillar of a company workflow. For instance, a trading execution team might have adopted a third-party library to optimise its trading execution process. What would happen if the library’s maintenance is abandoned by the open-source Python community? What if the latest update of the library contains a bug? This will ring all the alarms at the team as they will have to find an alternative solution immediately. Some good practices to minimise operational risk either have some team members developing a proprietary version of the library, or conducting a regular review of the library repository (e.g. Github). This is in order to ascertain whether the library has been left abandoned by the repository members and a significant risk might be looming regarding its future existence i.e. legacy library.
Installing Third Party Financial Libraries
More than a decade ago, the installation of libraries during the initial Python set up was plagued with many challenges, particularly the intertwined dependency between libraries. As a result, many incompatibilities were created that could not be identified ex-ante. Therefore, the overall process of setting up a Python working framework with a stack of specific third-party libraries was an utterly nightmarish odyssey.
Fortunately, the present day is brighter with library installation becoming a smooth and rudimentary task due to many factors such as the introduction of distribution versions (e.g. Anaconda, Canopy, etc), the creation of libraries that specialise in library management (e.g. pip) or the complimentary offering of tools designed to create and manage stand-alone ecosystems (AKA environments). All of these factors have solved many of the inherent issues emerging during the installation of third-party libraries and, consequently, paved the way for the popular adoption of Python over the last decade.
Using a distribution version when setting up Python for the first time in a computer is the best way to create an underlying backbone of compatible libraries, ensure consistency/reproducibility, and minimise library dependency risk.
Anaconda is the most popular Python distribution version, which not only installs base Python but also comes with several essential data-science libraries such as numpy or pandas. Moreover, Anaconda also includes a shell terminal versions (Anaconda Prompt Terminal) specifically crafted for Python developers and its own environment manager commands; among many other extras.
Nevertheless, most of the third party libraries discussed in this article are not included in an official distribution version like Anaconda and require manual installation. The most popular tool for manually installing Python packages is the pip library, which provides essential core features for finding, downloading, and installing packages from the official Python Package Index PyPI – the official third-party software repository for Python – and other Python package indexes. pip can be incorporated into a wide range of development workflows via its command-line interface (CLI). In a nutshell, most of the libraries in Python can be installed using the next command as they are part of the source distribution:
Nevertheless, there might be third party libraries that aren’t part of the source distribution i.e. non-official libraries. That said, a developer should check the official website or github of the library before executing any code. For instance, the library TA-lib (one of the most famous for technical analysis indicators), follows a different process as shown below for Windows OS:
- Check Python version and chip architecture:
- Download a whl file as pointed out in the library website or github. TA-LIB requires to download the whl from this link. The file you download should be the one matching the Python version and architecture from step 1 and shown in bold for this specific example:
- Use your terminal (Anaconda Prompt Terminal recommended) to go to your "Downloads" folder -where the whl file should be located - and type the next command referencing the new whl file:
- Finally check that no errors are trigger in a Python environment during the import:
Lastly, you can also use pip to install libraries from private repositories (e.g. github) following a similar syntax but with slight changes as showcased herewith a generic illustration:
Conducting Due Diligence for Third Party Finance Libraries
The next sections provide a deep-dive of third-party open-source libraries selected following thiscriteria:
- Legacy Risk: libraries selected must have a Github update at least within the last 15 months.
- Popularity: libraries selected must have been named in well-known research papers or forked from Github by at least 5,000 users.
- Safety: libraries selected must not have been involved in security breach events.
- Finance Focus: libraries selected have a specific finance-related purpose such as portfolio management, risk analysis, derivatives valuation, etc.
As discussed in the introduction, the recommended libraries are just a small sample of the vast number of libraries widely available online with specific emphasis on those with a flatter learning curve and easy accessibility for Python intermediate users.
This article emphasises finance-oriented Python libraries and does not cover generalist data science libraries like those included in the Data Science stack: Numpy, SciPy, Matplotlib, Pandas and scikit-learn. A data scientist must be extremely confident with the aforementioned before jumping into third party libraries such as the ones I discuss in the following lines.
Library/Resource Focus: Data Streaming and Datasets
In this article the spotlight will be on highlighting some useful libraries, resources, and APIs to access data freely for finance professionals willing to practice with financial data or any other type of datasets. Future articles will cover third-party libraries and resources for specific finance tasks such as Portfolio Optimisation, Technical Analysis or Fundamental Analysis; among many others.
The most useful resource for Finance practitioners is pandas-datareader , which allows you to freely download with a few lines of code a significantly large amount of financial data from multiple sources. Some examples of useful datasets that can be obtained using this library are:
- Stock market price data: Yahoo Finance , Alpha Vantage (Intraday), Stooq (Index data), and Tingo (Mutual Funds and ETF NAVs).
- Macroeconomic indicators: econdb, World Bank, St. Louis FED, Eurostat, and the OECD.
- Equity Factors Performance: Fama-French.
- Trading Calendars: Pandas market Calendars, Python Bizdays, and Exchange Calendars.
An important warning is that pandas-datareader also comes with a few pitfalls due to its lack of scalability for corporate applications i.e. limited amount of queries, low security standards and, most importantly, concerns in terms of data reliability (bugs, survivorship bias, etc). The last soft point is particularly concerning as it demands an extra effort from the in-house data engineering team to ensure a high degree of data integrity is attained before the data is deployed within the production pipeline of the organisation. Therefore, pandas-datareader is a good option for exploratory, prototype testing and self-learning purposes, but corporate usage is not advisable.
TA-lib was another library mentioned earlier to easily obtain technical indicators using historical price and volume data, including more than 150 indicators such as ADX, MACD, or RSI. Fundamental analysis information is more difficult to obtain compared to pricing data, yet financialmodelingprep is a great choice as it comes with a free API and premium subscription service with many add-ons. Both TA-lib and financialmodelingprep resources will be covered in detail in future articles of this Data Science skill-set series.
Other honourable mentions to access freely available datasets with a reasonable degree of quality are pointed out in the next lines:
- Google Dataset Search – generalist datasets.
- UCI Machine Learning – generalist datasets.
- Awesome Datasets – generalist datasets.
- Kaggle Datasets – generalist datasets.
- BIS (Bank for International Settlements) – global macroeconomic data.
- Data.gov – US Government official data.
- EU Open Data- EU data.
- IMF – global macroeconomic data.
- World Bank – global macroeconomic data.
- Kapsarc – Energy datasets.
- Eurostat Comext - European data on international trade and manufactured goods production.
- OECD – global macroeconomic data.
What do Third-Party Libraries Mean for Me?
The Python community has grown dramatically over the last decade, which has resulted in plenty of options in terms of Finance-focused third party libraries. Regrettably, this is a double-edge sword as some due diligence is required before adopting an external resource into our workflow to tackle operational, technological or security risks in advance.
This article has introduced third party libraries and some examples of libraries, APIs and other resources centred around data streaming and stand-alone datasets. Future articles of this Data Science Skill-set series will involve conducting a thorough comparison of libraries for a specific purpose (Portfolio Optimization, Backtesting, etc) or deep diving into specific libraries and resources for a more comprehensive description e.g. TA-lib for Technical Analysis.
Carlos Salas Najera: L/S Portfolio Manager | ML for Investments Lecturer & Consultant