What is Alternative Data & How to Use It
Table of Contents
- What is alternative data?
- What are the attributes of alternative data?
- What are the sources of alternative data?
- How is alternative data collected?
- How to assess the quality of alternative data?
- What are the use cases for alternative data?
- What are the best practices for alternative data use?
- How is alternative data priced?
If you’re looking for the extra edge over the stock market, then alternative data is for you
Alternative data (or non-traditional data) is collected from data sources, which are usually unavailable to the general public and at first glance may not have a direct connection to a specified use case.
In most cases, alternative data aims to supplement an already existing understanding of a certain economic development with correlated data that is rather exclusive or at least only available to a limited amount of parties. It is mainly used during financial analysis to gauge a company’s performance before public announcements are made by the company itself.
Using alternative data can give an investor competitive advantages over others, thus being able to beat the average market returns. This difference is called alpha.
What is alternative data?
Alternative data is information collected from non-traditional sources that can be used to drive key business decisions and stock market investment opportunities. The data can be collected from a range of sources like financial statements, credit and debit card history, satellite or geographical sources, weather patterns, and web data collection. Alternative data can be industry-specific and dependent on the sources from where it is extracted.
The purpose of alternative data is to provide unique insights that can be turned into actionable intel. It provides companies a competitive edge over their competitors who use traditional data sources exclusively.
What are the attributes of alternative data?
The attributes of alternative data can be separated into three categories: those related to individuals, those related to business processes, and those produced by sensors.
Here are some of the key attributes for each category:
Social/Sentiment, Web Traffic, App Usage, Survey
Social media is a mine of user data. Many types of data can be collected from social media sites, sentiments in various platforms, traffic on websites, etc.
Credit/Debit Card, Web Data, Public Data, Email/Consumer Receipts
The purchase data of customers can indicate if a sales period is performing strongly or not. This is directly tied to how the companies are performing as well.
Similarly, supply chain insights can be gained from the fulfillment records of toll stations, airports that carry freight goods, and ports. This can reveal a lot about how much quantity of a product is moving from a company, out to the retailers.
Car dealerships or other sales-oriented businesses prepare quarterly reports that can reveal sales numbers. Car insurance policies, through a partnered company, can predict these numbers very accurately. This allows financial institutions to buy or sell shares before any public announcement has been made of these results.
Geo-location, Satellite, Weather
Crop yields are largely dependent on environmental factors like bush fires, droughts, rain, precipitation, etc.
Satellite images can be monitored closely which can predict how the crop is going to perform in a year, thus enabling a prediction for the market supply, which will affect the demand. This will, in turn, define the prices of various products made out of these crops.
So, this is a relevant and vital source of information for companies that manufacture financial products out of these crops. It is also crucial for stocks of companies that depend on raw materials.
Business process data tends to have a more structured format than sensor or individual data. Credit or debit card data is extremely accurate when a consistent, large sample is considered.
Transaction data collected from email receipts is accurate in a much smaller batch. Considering all this, these batches can be on the more expensive end for data licenses, and therefore tend to be costlier than data from individuals or sensors.
In order of accuracy or insightfulness, credit or debit card data provides the highest percentage of data accuracy at 28%, whereas web scraping stands at 27%, and email tracing at 7%. Geolocation data has the highest level of inaccuracy at 28%, followed by satellite data at 22%, and web traffic at 10%.
The most popular data sets with the investors are web traffic, at 43% of funds using this as a source of alternative data, followed by credit or debit card data at 38%.
What are the sources of alternative data?
It can also be helpful to divide the different sources and collection methods for alternative data into the categories of individuals, business process, and sensors.
Alternative data from individuals is collected through search engines like Google, Bing, and Yahoo and is also collected from social media sites and e-commerce stores. The collection method for this tends to be scraping.
These portals hold a large collection of data related to the interests, contact, name, and location of users. This data is however much unstructured and can be difficult to process.
Credit or debit card transmissions, sales records, insurance records, and data collected by government agencies are business process collected data. This data collection source tends to have high accuracy, and therefore, can provide a great level of business intelligence accuracy.
It is also known as exhaust data as this type of data is typically sourced as by-products of business processes. Business process data is highly valued by companies, but they can be expensive depending on collection methods and licensing fees.
Sensor data is much unstructured and needs extensive processing to become useable. Sensors are widespread in the world where the internet of things is progressing at a rapid date. We have CCTV, POS systems,electric toll gate transmissions, and various smart home devices that are constantly interacting with systems and providing data.
Satellite data imaging is improving every day and the accuracy and definition of geo-locator devices is also on the rise.
This type of data can be crucial to detect weather patterns -predicting market outcome. They can also be important in detecting the shopping behavior of consumers and how products are shipping to certain places.
Useful alternative data is not easily collectible due to the very fact that it is outside of the norm. For this reason, acquiring it from a data provider is recommended, as this will save you time and money, as well as likely increasing the chance that the data will be of benefit.
Assessing the quality of alternative data is challenging due to 3 major factors.
Understanding the value of the data collected from a certain source yields transparency and credibility. Some of the datasets that get collected have no research or track record behind them. It is difficult to predict if this data can be effectively used in situations.
Moreover, much of the information that is collected from the very basic level – say from the product sales or service – cannot be effectively translated. There is no predicting how valuable this data is going to be for tradable securities.
The relevance of data is important in investment factors. Most of the data collected from an alternative data source need a lot of processing through neural language or artificial intelligence-driven machine languages. These datasets are collected in an unstructured format and need severe processing to become intelligible.
The issue with this kind of dataset is that they are also very limited. A lot of data must be collected to conduct proper back-testing. Vendors must wait and collect enough data to create a historical archive for it to become relevant.
As this data is unstructured, a lot of formatting is also required for getting content that has integrity. The format should be useful for companies that are investing in the data. All these factors make data collection and processing difficult; thus, vendors must be able to comply with all these demands to have credibility.
The value of the data depends on capacity. Capacity is the determining factor on how much you can even trade, and whether the investment is worth the insight received from data providers. Niche data that only caters to a small number of stocks or technology, healthcare, or retail can be of limited value. The more users target these niche data sets, the less valuable it becomes because of arbitrage. Therefore, vendors also have sophisticated enough data models to detect these trend changes.
How is alternative data collected?
Web scraping or web harvesting is the practice of collecting data from various websites on the internet. Scrapers or bots visit various web pages and download relevant information which is then processed through a collection of text processing functions.
This information can then be extracted and transported in a spreadsheet or transformed into a form that can be very easy to understand. Web scrapers can extract contacts and other details from a page and export this in excel sheets, or other formats.
Web scraping is prevalently used in lead generation, market analysis, price comparison, and competition monitoring, gathering data from multiple sources for analysis.
Acquisition of raw data
Raw data is a collection of data that must be processed to be used, but unintelligible in the original form. Sensors are a great source of raw information that must be cleaned for noise, interference, or other contaminants before it can be effectively used to gather market intelligence.
3rd party licensing
Some companies can get licenses for collecting exhaust data. This is the data that is a by-product of a business process. Different companies can have different rates for selling licensed exhaust data such as POS transactions, debut or credit card transaction details, etc. This data is then processed in a structured format that is sold to various companies. Major players in this field include organizations like Quandl, YipitData, and iResearch.
What are the challenges with alternative data?
As stated above, the collection of quality and valuable alternative data is perhaps the main challenge. There are various sources of alternative data extraction.
Non-traditional data sources – Collecting digital exhaust from web traffics. Gathering logistics data that can quantify the shipping activities of a company are usually non-traditional.
Unstructured data sources – This kind of data can be collected from web scraping, social media, surveys, etc. This data is highly unstructured and significant investment needs to be made to the processing of this data through machine learning or neural language processing.
Aggregated transactions – This is financial transaction data, which has high licensing fees.
And lastly satellite imaging, or geolocational data collected from various sensors.
The issue that is common with all these sources is the collection. Not only the collection methods are expensive, but they also require a lot of computing power. Each day, there is 2.5 exabytes of data being generated, which requires a huge storage server, processing capacity, computing power, and analytical resources. And this is not even a fixed amount. The amount of data that gets created doubles every 40 months, so the collection will always remain the biggest problem with alternative data.
Alternative data is unstructured
Since the collection methods are varied and have such a huge volume, this data is also very loosely structured. It cannot be received and integrated into its raw form. It definitely cannot be utilized until the vendors provide a very highly processed, quality content version of the original data.
Now, some of this data is a little more structured than the rest of the data. They could have patterns that can make categorizing them easier.
However, the real volume of unstructured data usually does not have any patterns, labels that can easily categorize it. This could be an audio, video or social media-related data.
This unstructured data cannot be consumed without transformation through various analytical platforms. This could mean expensive proprietary algorithms, advanced technology, a combination of multiple data sources that transform the data into a structured format, etc.
The main challenge is narrowing down the data, cutting the noise and interference, enhancing the connection of various data points. All this also needs to be done transparently and dynamically which instills confidence in investors.
Incomplete or unverifiable data
The problem with unstructured data is that it also tends to be incomplete. For example, the slightest gap in a time-dependent series can make the complete dataset useless for conducting a historical analysis. The unstructured nature of data also makes it difficult to perform quantitative analysis with the data.
Whereas structured data sets that can be gathered from website activities of users over some time would be easier to implement in a design. Easier to test and create insights that could be valuable for trading. However, for unstructured to become useful a deep archive is needed that can be utilized for quantitating analysis.
The data might be of limited use
Incomplete data has a very limited scope. In a recent project, building a scoring model outside the US – it was discovered that alternative data attributed to 8% of the ROC.
ROC is the measure that tells how powerful a predictive model is. The higher the ROC of a predictive model – the better is the accuracy. As alternative data contributed 8% to it, it cannot be considered completely useless.
However, it is nothing compared to the 92% predictive power that came from data sourced by vendors using traditional means of data collection. This limits the application power of alternative data severely.
Privacy concerns can be crucial
Most of the alternative data sources and collection methods we have discussed are a result of commercial activities by users. This presents a huge concern related to user security.
Customers can have severe security concerns. Previously, there were only cursory restrictions on the usage of third-party data to protect it. Originators would operate with much more freedom concerning their internal datasets.
With the General Data Protection Regulation, the EU has tightened up data security to consumers greatly as of 2018. This is the case for many other countries as well. Violators are punished with severe financial penalties.
Potential lack of transparency among data providers
Even if all this, if an investment firm has been able to procure a data set that is deemed desirable, the problem of sourcing remains. The alternative data sector is still new, and the data firms or owners can be pretty inexperienced.
They also have proprietary algorithms used to process the data that is not transparent. This is a concern with many investment firms.
Alternative data collection deals with crossing a lot of bureaucratic barriers with the companies that own the raw data.
How to assess the quality of alternative data?
Alternative data quality can be evaluated against the following attributes:
In the entire data set, there is only one unique entry. This is the basic idea. So, consider that you collect the health records of 100 patients from a hospital, and you see that there are over 100 list items. This indicates that there is either irrelevant data in the dataset or that data has been replicated for any of the patients.
Business requirements and situations can largely depend on how well the data has been analyzed, and any duplication in the data can skew the results, and provide an inaccurate outcome. Ideally, 100% uniqueness of data is desirable, however, in a real-world scenario, it can be difficult to achieve that with an alternative date.
Time has a gigantic impact on data. Let’s look at a certain kind of data. Previous sales, product launches are such scenarios where collected data tends to be accurate only for a certain period. However, driving real decisions based on data relies not only on gathering the correct information, but also the timeliness of collection. The accuracy and value of data can fade away over time.
The number of traffic accidents happening 5 years back would not have relevance in today’s world when a company is trying to make decisions on what infrastructure is required now and in the future.
Data sets collected through various traditional sources go through a lot of standardized testing. The validity of data establishes how the data items can be connected to the source, in case there is a need to understand how credible the source is. Data items must be connected with real-world contexts or the data is simply not adequate in its integrity.
The accuracy of datasets can be determined by connecting them with the method of identification. There can be a version of established truth. This is used as a reference point, and any deviation from this reduces the accuracy of data items. This real-world reference can be based on various business requirements. Any data item that accurately reflects the characteristics displayed by real-world objects is credited to be accurate.
Data must align with a preconceived pattern in the majority of scenarios. If we look at a collection of birth dates, for example, we will see that in the US the date format followed is MM/DD/YYYY. Whereas in the rest of the world, the usage format is DD/MM/YYYY. Any fluctuation in this consistency will make the entire dataset subject to invalidity.
What are the use cases for alternative data?
An explosion in the natural gas hub of Austria in 2017 impacted the availability and supply of gas all across the EU. This created huge uncertainty in the Futures, with prices skyrocketing.
Dataminr revealed that the clients already had an insight into this foreseeable event. They had every chance to take action before the market moved. The company received over half a billion dollars in funding in 9 rounds. The series E funding itself raised $391 million for the company.
One of the biggest issues in regulated institutions is the possibility of insider trading, or dumping of shares before any major events. This is very hard to prevent because communications between traders happen over calls, and they are extremely hard to monitor.
Digital Monitoring is a company that is powered by AI. They can process language and deliver human-centric insights that can detect risk. This product monitors a large amount of data (which is majorly unstructured) and then uses AI to transform it into intelligible data points, so institutions can take informed action.
Traditional tools can be very unpredictable when it comes to determining how the financial market will react to certain current events. The amount of data that has to be taken into consideration is huge. This is where artificial intelligence has come to the rescue of data providers, and buyers.
Kensho is a company that combines artificial intelligence, natural language processing, and GUIs. They used secure cloud computing models that provide a lot of tools that can process this amount of data.
Kensho is capable of analyzing millions of data points by scanning hundreds and thousands of customized variables. This could be linked to economic reports, earning releases or company product launches. The product performs very similar to search engines like Google, except it is specialized for financial analysis. And it allows traders to ask questions in English.
Humans can understand the difference when they see images before and after a natural calamity has ravaged a place. But it is simply impossible for them to process millions of such images to conclude promptly. But this kind of data set is plenty and readily available from satellite images, drones or other flying vehicles.
This is a lucrative source of information that financial institutions can use to forecast future events, and turn trades into profits.
Orbital is one such company that makes this possible. Orbital makes it possible to source, process and analyze satellite and geo-spatial images and other data. Then various government agencies and businesses can use the processed information and take actions based on that. This product can be used in other sectors like energy, agriculture, etc.
Alternative credit scoring
Lenders often face a difficult situation where they do not have a lot of credit information on small business or low-income households.
Without the availability of traditional credit scores and history, they file off these cases in the rejection pile. Or these people get charged a very high-interest rate that is much higher compared to what people with good credit ratings can avail. However, this is no definite proof that companies or people will default on loans, or they will have arrears if they have a poor credit history.
Aire is a start-up that started in London in 2014, that created a credit score for such people with the help of alternative data. This can help small businesses and individuals qualify for credit. The company employs an algorithmic formula to generate a score, based on the character and capacity of the candidate through machine learning. This imitates human intelligence. Whenever candidates do not have enough insight that can generate a credit score, the API of Aire starts working.
It is integrated into various lending platforms, and it creates virtual interviews with the help of financial maturity, career, potential or lifestyle of the applicant.
What are the best practices for alternative data use?
Assessing the value of alternative data sets
Quality checks are important when procurement teams are collecting the data. This could be anything from verification of the backtesting models, to the underlying code.
Internal users should also be consulted on the potential value of the data. Although this is a time-consuming task, some tools can help with this and should be treated with high importance.
Receiving alternative data or non-traditional data in a standard format remains one of the highest priorities of investment managers. This could be accomplished with various industry-accepted integrations that can send data from one platform to another. The same mechanisms or integrations can be used to get earnings estimates, rating information, and market prices.
The data teams at various institutions are employing the use of analytics languages like Python or R. there are various tools like Excel, Tableau or Matlab that can be used to create a consumption experience and make the life of an analyst much easier. In the end, the goal of using integration methodologies is to make data transmission, and handling easier.
Ensuring data quality
The quality of a data set speaks to how accurate and complete the data is. How timely it is. This will determine whether it fits the requirement for an intelligence job. The data owner should be capable of collecting data from the source continuously. Enough measures should be taken to ensure the physical collection and delivery of the data.
The data collection practices of the source should also be monitored closely. This could validate the entire data collection process by enabling mechanisms such as employing time-stamps. It should also be taken into consideration that selection bias is removed to ensure data quality.
Dealing with inexperienced data sources
Often, the issue with the data set is that it comes from inexperienced sources. People use various data sources to monetize it when they don’t even know how valuable the data is and how it can be used.
Many kits and guidelines can help understand the mechanics of commercializing data sets. People are becoming aware of this rapidly. Even 5 years back, there was not much help in this sector, but that scenario is changing.
Sell-side and other third-party integrators
Firms want to speak to the data sources directly. Being in direct contact with the data source gives them leverage over their competitors. Third-party integrators can do a great deal in identifying and delivering credible data sets.
However, the value of this data can be greatly reduced if everyone can access this. People who are seeking to finish distribution deals might want to lock in on exclusive, one on one deals with the sources.
Securing reliable data is hard work
Processing the data into a format that can be consumed by all is not a simple task. There need to be additional sources to confirm the validity of many of the data sets. And whereas some of them need to be anonymized before they can be sent for consumption by investment managers – this makes the process pretty unverifiable. Suppliers need to understand what quality of data they are receiving if they are to have a transparent relationship with the vendors.
How is alternative data priced?
There are two categories of buyers of alternative data.
Portfolio managers reap a tiny fraction of the alpha, from various data sets. This is the process through which they can make diverse strategies, aiming at making a lot of small bets.
The second category is investment managers who reap huge sections of the alpha from a small number of data sets. These funds tend to have concentrated portfolios, which means they are aiming to make large, high conviction bets.
Alternative data price largely depends on the source, type of information, a plethora of data set, the type of data, and the firm itself.
The budget for alternative data is on the rise. In 2018, nearly 2/3rd of the market had zero or near-zero budget for alternative data. However, there are more buyers in the market, and there has been an 8.8% surge in demand. A quarter of these buyers had a budget just above $1 million in 2018, however, in 2019, this number became 53%.
The potential value hidden within alternative data is practically incalculable, and with so much data changing every day, the potential for an alternative data source might become a goldmine could occur at any time.
If you want to stay ahead of the game in the stock market, check out the top alternative data providers on Datarade to start your hunt for the right data for you.