Matei Zatreanu (How alternative data is changing the world of finance?)

By Judgment Call Podcast April 21, 2021 4:14 PM UTC

Download full episode here

00:01:43 How Systems2 discovered a world of alternate data and how financial data companies in this space operate?

00:12:23 Can there be a profitable Open Source investment community?

00:25:53 How much time is spent developing algorithms or cleaning datasets in Matei's work?

00:34:11 What are surprising and low cost datasets one can acquire?

00:44:20 How good are predictions for actions of individuals today? Does Facebook and Google know everything about us?

00:58:20 Why antagonism is a great way to learn about your strength?

And much more!

You may watch this episode on Youtube - #68 Matei Zatreanu (How alternative data is changing the world of finance?).

Matei Zatreanu is a quant, investor and entrepreneur and now heads System2.

Welcome to the Judgment Call Podcast, a podcast where I bring together some of the most curious minds on the planet. Risk takers, adventurers, travelers, investors, entrepreneurs and simply mindbogglers. To find all episodes of this show, simply go to Spotify, iTunes or YouTube or go to our website judgmentcallpodcast.com. If you like this show, please consider leaving a review on iTunes or subscribe to us on YouTube. This episode of the Judgment Call Podcast is sponsored by Mighty Travels Premium. Full disclosure, this is my business. We do at Mighty Travels Premium is to find the airfare deals that you really want. Thousands of subscribers have saved up to 95% in the airfare. Those include $150 round trip tickets to Hawaii for many cities in the US or $600 life let tickets in business class from the US to Asia or $100 business class life let tickets from Africa round trip all the way to Asia. In case you didn't know, about half the world is open for business again and accepts travelers. Most of those countries are in South America, Africa and Eastern Europe. To try out Mighty Travels Premium, go to mightytravels.com slash MTP or if that's too many levels for you, simply go to MTP, the number four and the letter you.com to sign up for your 30 day free trial. I like how you say that and now it's settled. Yeah, I love it. I love it. I love it. Thank you for coming on the podcast. I really appreciate that. Thanks for being here. Thanks for having me. Hey, absolutely. And I really appreciate your background. You've been a quant. You've worked in the financial industry and now you're an entrepreneur. You're running your own startup system too. And from what I understand, it's all about alternative data and how this massive amount of data can be used to make companies more efficient and make companies better and make more money also. Maybe you can tell us a little bit about your company and your background. Yeah, I'd love to. So I start off in relatively traditional finance environment. I start off at an investment bank and after that, this was Lehman Brothers in 2008. They went under during the last crisis. It's kind of funny because now you have to specify which crisis it was. It used to be enough to just say like, oh, during the financial crisis and now that we've lived through multiple once in a lifetime crises, you have to kind of specify. But after that, I end up at a hedge fund where it was, I ended up spending almost a decade at this fund. And my role was, from an educational perspective, my background is in math and stats. But this fund was very traditional and had humans that were making these decisions about the companies to invest in. A lot of those decisions were made sort of qualitative. I mean, obviously, there's a lot of information, a lot of data that was consumed, but it was done in the more ad hoc manner by individual human beings. And I ultimately ended up starting a data team, data initiative for this fund that was responsible for answering. I figured out what kind of questions would we even want to understand, an ideal world? If you can have an answer to any question, like I was like, Google 2.0, and you can know anything about the companies that you're looking to invest in, what would you want to know? That started off this waterfall, this cascade of searching for different kinds of data sets that they can answer some of these questions that, beforehand, could not be answered. So, for example, and we can get into this later, but we started having access to things like credit card data. So, seeing consumers transactions on a daily basis, like some percentage of the consumers transactions, and that gives us insights into how companies are performing. We have data for everything from web traffic to location to all sorts of very niche data sets. For example, the mining industry or the travel industry that are really good to answer those kinds of problems. And about five years ago, I decided to leave the fund because I became convinced that the future of investing was going to be data driven, and I wanted to start a company that can really make that happen. So, we started System 2, and the idea behind System 2 is that we become the data team for all sorts of companies. It started off being mostly investors for no better reason than that's just the world that we came from, who don't have an in house data team, but want to leverage some of these data sets and the expertise that is required to make sense of it in order to solve real business problems. We're not academics, we're not here to write some interesting hypothetical paper on some obscure insight that we've found. We have to find something that's actionable. So, being actionable is really critical. And in May, over the time of expanding, we now work with corporates. I mean, the demand for not necessarily data, because people don't really care about data, it's the demand for insights. The answers to these questions that up until now were impossible to answer is just exploding, and having that kind of a skill set and ability to deliver these kinds of insights is becoming more and more popular. Yeah, you sound like the person who's up for the job. It's pretty amazing. And when I did some research about the company, indeed, I didn't find a whole host of white papers. There were a bunch of podcasts you were on. And that seems to be it. So, we noticed from a lot of Quan shops, hedge funds, they're extremely secretive. They're kind of like Apple. So, everyone who even goes out and mentions a little detail gets sued or changes employers. And that seems to be working for them, right? So, they seem to be in this particular niche of innovation that they have mastered. So, it's probably a very small secret. That's why they have to guard it so much. But they've been very profitable. Renaissance has been extremely profitable on a data driven investment mindset. How do you see the investment industry right now? There's been data driven, at least we recognize them. I don't know how Renaissance actually works. There's been data driven hedge funds for quite some times. They haven't done so well in the last two or three years, unfortunately, or maybe fortunately, that depends on your perspective. How do you see this world evolving? Yeah, that's a great question. So, just to kind of unpack the questions a little bit. So, in terms of the secrecy angle, you're absolutely right. I mean, if you look at even our own website, it has very little information on it. And that's again, just a byproduct of the hedge fund legacy. If you go on the websites of pretty much every single hedge fund, all you're going to see is just a splash page saying like, yes, we are this hedge fund. And there's just a ligand. So, if you're an investor in the fund, you get to ligand. But otherwise, there's absolutely nothing on these pages, except for maybe disclaimer. Because you have to have all of them. It's good for them to maintain any infrastructure to deal with their stakeholders and shareholders. Well, I mean, that is true. There's definitely no need to deal with it on a public level. But there is definitely a ton of stuff behind the scenes that you have to do reporting. And there's a lot of, there is actually a decent amount of infrastructure that's built the report to the end clients or the investors of these hedge funds. But that's partly driven by regulation. So, there are all sorts of rules around how hedge funds can market or how investment firms in general can market the people. So, to be safe and to be conservative, they haven't done that. Having said that, there is this, it is actually more collaborative than it might seem. There are a lot of conversations that are happening, but they're happening behind more closed doors or said another way they're happening through old school mechanisms, like picking up the phone and calling. So, before analysts at a hedge fund, that again, it's very secretive. And I'm talking right now, I'm not talking about the quant investors, because there's, there are two classes, there are the fundamental investors that still have people in the mix, you know, think of like the warm buffets of the world that are really deep, deep down in the fundamentals and thinking about these companies, then you have like quants on the other side that have machines and computers doing all the trading. So, I'm going to focus on the fundamental investors to start. They still, before they make these kinds of investments, they pick up the phone, they call up the cell sites. The cell sites are these, you know, the banks that put out research papers, they'll see what is known out there, because it's not enough for you to have a view about a company, but you have to have a differentiated view. If everyone else knows the same thing that you know, you have nothing, you have no, everything's already priced into the stock prices. And so, there is a lot of collaboration that is happening. But at the same time, yeah, it is just old school, like picking up the phone, calling your friends that you have other, you know, friends that are hedge funds or friends on the cell side. And there are entire like companies and ecosystems that have sprung up around this, right? There's the whole expert networks that their whole business model was to put you in touch with an expert who knows something about a certain company. But yeah, I think overall you are right. It is much more secretive and people do feel like they have something proprietary. To be honest, that's never been my view. I mean, I think the reality is like everybody's especially in the finance world, people think that they're like the smartest people in the room, like everybody thinks that. And when you, and it's also like everyone's above average, right? Everyone's above, of course, absolutely. And the reality is like there are a lot of smart people that are obviously working at other hedge funds and even in other industries. And so there is this benefit from being a little bit more collaborative. And what's been happening recently with the rise in like fente companies and companies that are working on these alternative data sets, that there definitely has been a lot more communication and there are more white papers. There are more, like it used to be enough, like when I first started doing this, it was around like 2014, just knowing that a certain data set existed was like huge. Very few people knew that you can buy credit card data. And then if you happen to know who the vendor was and you had access to that data set, you can make some really interesting trades. Now, like all the incentives point in the direction that the vendors want to sell to as many people as possible. And what ends up happening too is that people will just leave their jobs and it's like this constant like rotating wheel of who's where. And that just leads to more information getting disseminated. So the reality is like, and this is where we come on on these issues is that it's not so much about the information advantage. I mean, there is usually an information advantage, but it's usually for a very limited period of time. After that, it's much more about the analyses and just rolling up your sleeves and doing the hard work of analyzing these companies. And so that's sort of the secrecy angle. And you were talking about some of the quants like Renaissance and others, like their approach, like, yeah, they've been using data for a very long time. But to define our terms a little bit more sharply, the data that's been used in finance for many decades is like the traditional data, compared to alternative data, the traditional data tends to be more around prices. So how much are what is the stock price, right? This is a very simple question. But if you actually break that down into what it how you arrive at a stock price, it actually is a very complicated question. Because there's a market out there, right, where people are constantly trading these securities and there are prices for every single, every single trade. But these trades can happen in very small increments. And so like just to give you a simple example, like, well, okay, do you take the average price over a certain hour, like if you break the day up into hours, and you say, okay, what's the average price of all the trades happening in that hour? Like that sounds like a very reasonable way to find out the average price of something. But you might have a situation where during lunchtime, like nobody's trading, or if there are very few trades, and there are certain hours where there's a lot of volume. So now let's say, okay, well, maybe I want to place a higher emphasis on the prices where a lot of stuff has transacted versus like some one off guy who has happened to really want this stock during lunchtime, and he paid a lot of money for it. That doesn't necessarily mean it's not reflective of what's what the real prices of these securities. And so these traditional like quant folks spent a ton of time analyzing very granular like price and volume data, and trying to find patterns within that price and volume data. And there's obviously there's some other information that gets layered in there, but that was the bread and butter of it is being really good and trying to find these patterns on various granular scales, but across like tons of, you know, like long periods of time across tons of different securities. Yeah, it seems to be, I looked into that some time ago, and it seems to be a spark of I'd say an open source hedge fund. I'd say movement out there and there is alpaca, which is which is a broker where you can run your algos. There's been wealth fund, I think they started with something similar, but they moved away from that business model. And there's a lot of stuff on like code repositories like GitHub where you can basically download see what works for you. You can play with it, different AI strategies, and you can put in any data you want. But I fully agree. Typically, it's still the same kind of data, right? It's pricing data, it's time series from different pricing movements. It's volume data. And it's it's, it's kind of boring data. It's from what I've been told, there is really no music in that anymore, because so many people run their algos on it. And it's just unless you have some incredible insight, there's a very hard to make money with this, especially in markets. And you said that earlier, where we have, I think a financial term, where it's a one in a lifetime, I think there's a better term with a standard deviation. What is it? You probably know the the once in a lifetime crisis, which should only happen once in a lifetime happens every five years now, strangely, right? Because well, because we built these incentives, like we moved so much money on the other side of the scale, that's why we have the sudden crisis is the the open source community seems to be up to something. But obviously, you said that earlier is if once a certain amount of people have the same data, then your edge goes away and you can't make money. So the open source investing in that sense, if it is exploiting a certain data stream is probably it's an oxygen mode is probably never going to work. But I think yeah, I think it's it's a fair point. And it's, it's just a matter of timing, like it can work. But for some amount of time before other people pile on to it, I think that's one of the cool things about finance and what initially drew me to this is that what we're if we abstract away from our day jobs, what it is that we're ultimately doing is making predictions. And people make predictions in all areas of life. And finance is one of those few areas where you can actually figure out if you're right or wrong. And there's been a ton of work by my folks like like Tetlock, who have studied this phenomenon of like what makes a good forecaster. The problem is that you see this all the time on TV, where different pundits are making some kind of like things that are essentially forecast, but they're done in such a hazy way, that's really hard to pin them down to either they don't specify timing. So it's like, yes, I predicted whatever this is going to this is going to happen. But and it has 50% chance that the market goes up tomorrow. And it goes down the stunts. It's a, it takes a moment for the non expert to parse and realize it's all complete bogus, right? Do you absolutely right? Exactly. And there are other places like, you know, games like poker where people are making some kind of decisions and you're, you're trying to read your, your opponents and try to make essentially forecasts about what do you think is going to happen to the cards that are being dealt. And, but finance is a cool opportunity because you can make forecasts about so many things in the world. So many different companies, so many countries, so many different, like probably what have you, there's a way to express your belief and a way for you to find out whether you're right or wrong. So I think that component of it is really interesting. And so going back to like the open source movement, or if there are people that are trying to democratize this, it's, it's definitely possible, but it is one of those things where when somebody else, when a bunch of other people figure out the same thing as you, the, the, the entire alpha, like the edge that you might have had before goes away. And we can talk about this more as well, but what's been happening recently that's kind of interesting is the whole Wall Street best phenomenon, like the Reddit forums where people are getting interested in investing and they're collaborating, but collaborating in a very different framework. It's no longer like collaborating with the, your coworkers, because they, they're independent traders now collaborating within these forums. And a bunch of this stuff has moved to other places. I mean, there are Discord, if you're familiar with Discord, it's like a Slack app that's for, for gamers, typically, but now they have all sorts of these like closed essential servers that you have to apply the, the join into and they have to let you in or not. And once you're inside, people are discussing these kinds of ideas and they're sharing this kind of, this kind of research. And so there are a lot of, a lot of ways to collaborate and to share these kinds of insights, but some of the, the traditional ways of making money are still true, right? You have to know something that somebody else doesn't know. And the other side of it too is that you, you may be right about something, but like I mentioned earlier, the timing perspective is huge. And so like, you know, the old joke is that you can, someone's ability to, to be right is, is, doesn't matter because you're ultimately outweighed by someone's ability to, to stay solvent, to stay in the market. So it's, there's a lot that goes on into it. And it's, it's kind of cool because you're, you are making these forecasts, but at the same time, you're also playing a game essentially against other people and trying to figure out what other people might not, might not know. Yeah, the timing aspect is usually what gets me. I feel like my forecasts are usually correct, but I'm like five, 10, my time period is completely off. So my market timing is horrific. And I, I usually get out of an investment and then it goes that way sooner or later, but it's like five years later. So I'm, I'm, I'm terrible at this. And well, I was, I was chatting with Jim Rogers a few weeks ago. And he's been, he's been saying that's, that's still with all the knowledge he has accumulated. He's one of the best investors we know. And he definitely is one of the most public ones. And he said, well, market timing is impossible. You can't sometimes predict this catalyst, right? This is this moment when things actually change and they fall into place when you predict it all the time, but it might take 20 more years until this catalyst actually comes along. And you have to be extremely patient of it for these things. And by the time maybe your whole investment hypothesis, you might have forgotten about it because it's been 20 years. I don't really remember what I thought 20 years ago, about the next 20 years, right? So it's quite challenging. But you've been, and that's what you do with your company, you go into a slightly different approach, right? So alternative data basically means you use data that basically has nothing to do with the financial industry, like credit card data, or through this, I would count that still as the financial industry, but it could be hotel occupancy, or it could be, I don't know, traffic data, for instance, location data. And you take that data, you massage it, run it through algorithms, and then have the machines and the analyst make predictions in order to, and you use this knowledge to trade yourself or you sell it to hedge funds. How does this work? Yeah, so right now we are not, we're not trading ourselves, like we are answering questions that our clients come to us with. And so hedge funds or financial clients will have these kinds of questions about a certain company that they're looking to invest in. And the really, I think that the main differentiator between what the fundamental analysts do versus the quads, the quads, again, are just trying to find these like high level patterns in data, usually pricing data. Fundamental analysts are trying to find a story. So why is a certain company doing what it's doing? Why is Peloton succeeding? And, you know, you can have all sorts of stories, like obviously with COVID, people are staying home, they're no longer going to gyms, they're going to buy a peloton for their house. But then you would come up with so many, like once you start forming this, this story, there's so many different possible paths with that story. And now you're saying, what happens when gyms reopen and people get vaccinated? Are they going to go back to the gym or are they going to like their peloton so much that they are changing their behavior and sticking with that? And so once you have these kinds of questions, you can then try to figure out, okay, what kind of data set could I use to help me answer these kinds of questions? Like you mentioned, these data sets are, can be anything. And the only way to think about it is that there are so many companies out there doing their just regular day to day business operations. But in that process, they're collecting information. And that information can be and usually is valuable to people like us who are trying to understand something about that industry. So whether it's, you know, a company that works with internet service providers, and they're providing some kind of, I don't know, some sort of service to those internet service providers in exchange, they're collecting data about traffic to websites. Or it's you mentioned, you know, like hotels, there are companies that are working with the hotel, the airlines, all these different, different components that can decide to essentially like sell their data in order to give folks like us access to it so we can answer these kinds of questions. So it's an ability to, first of all, to think much more deeply about the questions that we're trying to answer. And then almost, it's almost like we're detectives, right? Okay, this is where we're trying to figure out like, what is out there? What might be out there that can help us answer this question? And how do we go about getting that data? And there's an entire process that involves like, first of all, like, let's figure out what is we're trying to solve? Like, what are some hypotheses of ways we can, we can try to solve this? And then how do we find the data? It used to be the case that you had to pick up the phone and call. So for example, there are companies that provide expense services. So if you send in pictures of your receipts, and they help you with your accounting and expense management, and we realize, okay, some of those, those companies have data on on travel plans. So if you're an employee of a company, you have to submit your expenses, and it's usually run by some of these companies. So why not give them a call and see if they would be willing to provide us with, you know, anonymized access to this data, so that we can figure out like, which rental car companies are people choosing, which flights, which hotels, and, and that way we can answer these kinds of questions. But over time, what's happened is that more of these, these companies, like there's been an entire ecosystem, or marketplace where there are now brokers that can help you find data, they're more companies that are trying to actively sell their data. So it's no longer such a such a search that you have to undergo, it's much more of, okay, let's figure out which of these items, data sets from a menu we want to choose. And that's only that's only the beginning. So once you have the data, there's an entire like engineering analysis process, because all these data sets are massively flawed. And our job is to figure out like, how, how much do they suck? I mean, that's the one sort of like one terrible part of our job is that you're talking to anybody who's trying to sell data, and their incentivize to tell you like, this is the most fantastic data that you'll ever see. The reality is it's all terrible. It's just the question of how terrible is it? And can we identify where, where is it terrible? Because once we know that, then we can start to figure out, okay, how can we fix some of the issues that that the data has. And the main one being is that we don't, we never see all of the data. It's not like this is, you know, those are even like the pricing data that we discussed, where you have a market that actively trades stocks, you can see all the information that's happening on that market, even that, by the way, took some time until they got the kind of disclosures and transparencies that investors really wanted. But here it's like, okay, I'm on this, you know, this company that handles receipts, and I only see maybe 1% of all the bookings for this airline or this hotel chain. And you can imagine that 1%, by the way, 1% is an average percentage that we see in terms of some of these, some of these data sets. And so we have to make do with just like 1% of all the customers. And how do we extrapolate from that, try to figure out what's happening to the rest of them. And by the way, that 1% is that's another really terrible thing is that it's never static, like these, these companies, these data companies are constantly adding new sources to their panels, they're losing access to some of their sources. So it keeps churning. And that churn causes a lot of a lot of issues. So there's a, we spend the vast majority of our time just trying to deal with all this like statistical nightmare of what's happening in the real world of data. And it's super messy. And there's, there's tons like business negotiation terms, like things that fall south, like gossip, all these, you know, we heard that these guys are about to lose one of their main suppliers, like what are we doing that case, how do we reposition entire engineering efforts in the chance that yes, we're about to lose one of these like major components of our data sets. And then ultimately not losing track of the fact that we need to come up with good forecasts for, for the questions we're trying to answer. This sounds fascinating. And one example that, that I personally, or personally from the media, I'm aware of is there were a bunch of Chinese companies who were listed at the NASDAQ and they claimed they have a certain factory output. And they had a certain amount of customers and there are a certain amount of business basically. And then based on satellite imagery, there were basically nobody ever went in or out the factory. And that was a clear sign that this business is not doing so well and might hide some facts. And I, I found that a genius idea. Maybe, maybe you guys would have thought of that immediately. I thought, well, this is a quite a difficult connection to make that you apply some satellite custom satellite imagery, because you need to change the amount over a couple of hours. When we look at this, I've heard this from, from other people who deal with big data that used to be the, the, the name before now it's moved more into AI, right? A lot of people have claimed that this is often the biggest problem is to, to clean up the data sets. And then there's a bunch of algorithms that basically you just download from GitHub or wherever you download them from, right? This is almost like TensorFlow gives you a huge library of potential algorithms that could show you patterns in the data that you see. When you, when you look at the challenges, and I think you just outlined that, is the algorithms that you employ in order to find patterns, is that a big deal or is 90% of the work is really massaging the data? Yeah, that's a, that's a great question. And it's, the answer is, is more complex when you, when you just go into the weeds of it. So algorithms, I mean, that's been the, the sexy thing, right? Everyone's talking about AI and machine learning. And there have been tons of companies that have raised a lot of VC money because they claim that they do all this amazing machine learning stuff. And you're absolutely right. A lot of the actual math behind it is publicly available because it tends to come out or has been coming out of academic institutions that are just putting up the, the papers online for people to, to implement. And a lot of code is already written, like you mentioned, that you can download these libraries and Python from, from, you know, your favorite repository and you can implement this into your code. So that has been more commoditized. I mean, when you go to the edges, you will find examples like, yeah, I guess the companies that have an advantage on the algorithmic side can do things like process data faster, maybe come up with an algorithm that's more efficient than the code that you can run there. And yeah, maybe that makes, makes sense if you're trying to, if you're a massive company and you're trying to analyze tons of images, for example, and you, any improvement in the speed of that processing power, it makes a, makes a huge difference. But for most of the things that we're talking about, like I mentioned earlier, like, most of the time we're only seeing a couple of percentage points of the transactions out there in a certain domain. And even those can be significant. But the challenge isn't really speed. Like we have plenty of time. Like we never butt into processing power or speed issues when we're running into challenges. It's much more the, the, the formal point that you're making is around the quality of the data. And honestly, the only way to solve that is, is not, in my opinion, it's not through better modeling. So I'll give you a perfect example. We're actually talking about this in our company this, this week, where you can have and try to make a forecast on something. And one of the, one of, one of us shows a result saying, okay, like this model works so much better than this other model. And your instinct might be, okay, great, fine, let's just like show this to the client. But what you really want to do is just say, okay, let's take a step back and try to understand why is that one model working better than the other one. And this goes back to just my personal bias, which I came from this, this hedge fund that was very fundamentally driven or thinking about companies in a fundamental way. It's the same thing with data. Like we think fundamentally about the data. So if we don't understand why something is happening, we won't do it. And this is honestly, this is a big contrasting point to how the quant funds operate. Like for them, they don't need to find real world causal mechanisms between the cause and effect. They just found a pattern. And they've seen that in the past, that pattern tends to hold true in these instances. And because they're working on such like high frequency time scales and across like so many different securities, they just need to be right 51% of time. They're making like 1000 millions of these kinds of transactions on a regular basis. And so if you're a little bit better than 50%, you can make a ton of money. Like we are, we are much more concentrated, right? Like when we think about these problems, like it's also an analogous to like the shotgun approach, we're just like spraying out these pellets everywhere and they're hoping someone we're going to hit your target versus like we're taking rifle shots. And our rifle shots have to be really accurate because we only get one shot on the target. And so when you have that, you have to have some version of this like causal mechanism of why something works. So to give you a concrete example, look at COVID, right? So the models that work pre COVID in whatever industry, let's say, I know you're in the travel space. So in the travel world, we had models that were forecasting hotel occupancy and revenues a year in advance. And those relation, that was based off of historical data because we knew what was happening based on the past whatever five, 10 years that we had data for, and we're making forecasts for the next year. And those forecasts were awesome. They're really, really good. But then COVID happens. And now, obviously, like something has changed in the real world. And so one of the things that you can do is to just say, okay, well, I can choose to continue using my old models, but those relationships are clearly no longer really working. Our accuracies have gone down. I can I can change a model and just say, okay, like what, how, but how do I change the model to account for this? So you can do things like you can create models that take into account what's called a regime ship. So we had some version of history where this pattern was true. And we now are in a different regime where some other patterns might be true. But these the choices in these like to choose to switch to a regime ship model are driven by some something that happened in the real world. The problem with that is like, go back a year. And if I did not know COVID was going to happen, or didn't know the extent of COVID, I will not have known to make that that change in my models back then. So in some ways, by using this this recent history, you're cheating because you would not have access to this information before. And another example that that's really relevant here is that, like I mentioned, we're using credit card data to calculate the revenues for all sorts of companies. What's happened post COVID is that so what what the result was is that the data has become much more representative of what's actually happening in these companies. So in some ways, our models are becoming less useful, because I mean, our models are making all sorts of adjustments, and they're still doing a bunch of that. But the data itself has become more more representative of what's happening in the real world. And the question is, okay, like, do I need to change my model again? What's happening here? If you go back to causal mechanisms, and you can start hypothesizing about things that are playing a role in this, one major one is that people are are much less likely to use cash in transactions. So cash was a big blind spot that we just didn't see this was a group of customers whose behavior was not represented in our dataset. And depending on the company, that group of customers could make a huge difference. And now as those customers are moving to credit cards and moving to online spending, we're now seeing a more complete picture, a more representative picture of all the customers of this company. And so, like, that that sounds like a plausible story, right? Like, you know, it makes sense to me, it makes sense to the people we're telling it to. But in our world, like, it's still just a story. So I don't want to change my model just on, like, it's good to have this kind of a story. But then the next step is to say, okay, can we actually validate this? Can we see and we do actually see this in the data, we can see how people how much how many withdrawals are taken from an ATM. So we don't know where they're spending the cash. But we could see that some people are used to be much bigger cash users than others. So do we see those people now moving on to the credit cards and so on. So the idea is that you once you have these causal mechanisms of what's happening in the real world, you can inform model decisions. And then you test those model decisions. And if that ends up playing out, and it's like, yes, cash is an impact, we now have a proxy, it's not perfect, but it's a proxy of cash usage. So we can now account for that in our models, and our models have improved versus just saying, Oh, our models are now like, you know, machine learning, it's black box, we don't know why the models change, but we'll just accept the fact that they've changed. So that's been our bias is like, we want to understand what's happening. And when we communicate to our clients, the majority of our clients are humans, right? So like, it's not enough for me to tell you, Hey, my model just did this, like, trust in my model, like you have to understand the story. And, and the way to do that is through stories and through these explanations that make causal sense. And you're like, Okay, yes, this makes sense. This adjustment makes sense. I'm now going to trust this new model. Yeah, I find that's a real big problem in selling models, right? Because there, as you said, essentially black box, and it is a heresy in the machine learning community to ask these questions, you just explored, right? You, you cannot make, there is no hypothesis, there is no causal relationship, there is never a question why this model works, it works or it doesn't, right? And then that's called the loss function. It's it's inside of it. And I find this on one hand, obviously, it creates is a normal scalability because you never have to worry about these difficult questions. On the other hand, we are very often predicting the past. And we have, as you said, when major assumptions change, the models just go awry, it is not much you can do about it, you just have to wait until you have enough of a data stream, you cut off the data, and then you start from scratch again. And then you just basically ignore anything that was beyond that data stream started. When you look at really cool data sets that we don't know about, like, I didn't know you can just order custom satellite imagery just like that, I thought you need to be like, you know, the CIA or the Russians, what are really cool data sets when when you, I don't know, maybe your family or other friends, when you tell them about this, they say, well, really, that's possible, that's relatively low cost, say under $10,000. Yeah, that's a great question. I will say that this people used to be much more surprised when we first started this, you know, like five, ten years ago. And exactly the point like we have been, we've worked with satellite imagery and it's exactly you say you can reposition satellites, you can take the images from the satellite, because the satellites are constantly flying over the planet. And you can take what's given to you, but you can also pay extra and you can reposition the satellite to get special images for you. And it does, it is very science fictiony, like this ability to go in there and try to collect all this kind of information. So just one point on the satellites before I move on to some of the other categories that I think are interesting. Satellites are definitely cool, like it's really cool to be able to pull up that image and to have that ability at your fingertips on your laptop from a cafe in some random country in the world to just say, I'm now going to reposition a satellite. It's not quite like that because you basically have to send the invoice to pay a bill and someone else does it for you, but it's effectively where you go. The reality of it though is that once you get past this like coolness factor is that there are some major limitations with satellite data. One of the main ones, and it depends on the use case, right? So you, we mentioned the use case of seeing whether a factory is operational or not. That's kind of a binary signal. Like yes, there are people there or like there's nobody there. And that's a very easy signal to detect from a static image. The problem with satellite imagery is that from a temporal perspective, they are very sparse. So the satellite might image the same portion of the globe, the same area, once every few months. And so one of the other use cases that people were using were trying to use satellites from the beginning was to monitor the number of cars in the parking lot at a store. And so, okay, yes, theoretically that would be, that's a good proxy of the kind of volume that the store is doing. The problem is, if I tell you that in reality, you're seeing an image from like a Tuesday in January and the next one is on a Sunday in April, perhaps different times of the day. There's not much you can, you can do from it because obviously weekends are a much, play a much bigger role in the traffic to mall than does, you know, this, this image that you have. And so that's one of the main problems. Like we don't have regular, like daily, same time day images of the same parking lots over time. So, so the satellite, the satellite data has not been as useful for those kinds of questions. And it can be useful to like, do things like monitor construction progress. So if I'm, if there's some big power plant, or if there's a, some kind of project that I don't have easy access to, because the middle of nowhere, there are barely any roads that lead up to it, it's gated off. And I want to know, are these guys on track that to build their facility or their, their power plant, then yeah, using satellite imagery is awesome. But then like putting, putting yourself in sort of like in our shoes and playing the detective here, it's, it's cool because you kind of feel like a, like a secret agent, but the same time you don't have any of the resources of the, of the government, right? It's not like we can just do anything that we want. And so we're like, okay, what's the, what's the poor man's version of this? So I guess satellites would be super expensive to, and it's physically impossible to get this data on a, on a daily basis. What else can we do? So then you start thinking, okay, what's the next best thing if I want imagery? Well, can I pay some guy to like fly an aircraft over this, this area and take pictures for me? Like, okay, that sounds plausible. How much does that cost? And by the way, this is like, this is my favorite part about our job is that, unlike the textbook examples of like data analysis, we were just given a data set and it's like, okay, implement some cool models on it. We were the emphasis is just on the models and like, you know, whatever the data is super clean. For us, like the data can be anything. And we tell the folks that, that join our company, it's like, don't be bottlenecked by what you, by just what's handed to you, like use your creativity and figure this stuff out. So it's like, yeah, like airplanes fly, we know that you can take pictures from airplanes, great. Okay, how much does that cost? I have no idea. Let's just call some, like, how do you even start approaching that problem, right? Just go to a small airport and call them up. There's like people that fly small aircraft and be like, dude, you're flying, like, how much would I have to pay you? Now you're talking real super, like James Bond type, right? Secret agent. Exactly. Exactly. James Bond, when he's in trouble, and no one wants to help him, you know, and you're just like, I don't know, I got to go make friends and I obviously need to pay some money for this, but I don't have a lot of money. And so how do we, how do we solve it? And yeah, so we can definitely do that. You can pay people and turns out it's like for three grand, you can, you can get somebody to fly aircraft for you and take pictures. You can also pay people with drones, right? There are now drones everywhere. So you can pay a drone operator to who lives in that area. That's even cheaper. And you can get videos, you can get better resolution stuff. You can also just pay people to drive by and take pictures from their car. Like if they're on public property or public rows or sidewalks and you can just take pictures of these things, that's great. Why not, why not do that? So, so there are all these different ways of collecting, collecting information once you understand what the, what the real problem is. But in terms of other data sets out there, I mean, like it's, there's a lot. I mean, there's a lot of information that's being collected. Like one of my other favorite ones were companies like antivirus companies that are, let me take a step back, anything that's free. And this has always been a joke. Like if you're not paying for something, you're the product, right? So there are, they're antivirus companies that were providing the software for free, but in exchange, they're basically collecting all this data and selling it to people like me that, and we're using it to identify people's like browsing behavior and understand like the websites that they're going to. You have data on everything from like mining operations to railroad companies. Location data is another category that we mentioned. So we're all carrying tracking devices in our pockets, right? Our cell phones are constantly tracking our, our location, there are a bunch of apps that have access to that. And that information is increasingly becoming available for, for sale. So you can, yeah, I can, because that's another example, going back to the mall example is like, well, why don't I track the cards if I can track the people. And now I see them much more. I see them on a regular basis. I can see exactly where they're going. And there are, there are tons like honestly, anything, any company that you can think of that has, that does something, they're probably collecting data, and that data might be, might be useful. So, yeah, there are similar things across the world. Like there are companies that are doing this across the world. So if you wanted location data in Brazil, like it's, it's exists and in Asia, all these things exist. And many other also some more bespoke things that are relevant to some esoteric investment that you care about, like understanding data center consumption. One, that's one of the, one of the trends we're trying to figure out how many people are, how many clients is this data center have. And you're like, Hey, how do I solve that problem? And you know, brainstorm a couple of the bunch of ideas. One of the things that's really interesting is that it's, you're trying to think about what's correlated to the thing that I'm trying to study. So maybe I can get access to like, you know, a number of people coming into that data center, but that's not going to be representative of how many customers they have. One of the things that is, is electricity usage. So if I know how much electricity usage something is taken in, I can figure out how many servers they have in that data center. So then you ask yourself, okay, how do I measure electricity? And you have come up with like five different ways that you might be able to do this. Some of them are really crazy. Some of them are less crazy. Like one, like one of the things that we did two years ago, we're like, okay, we should be able to measure electricity. Maybe we can get the meters, but if the meter is on the property of the data center, like won't be allowed in and we're not, we're not trying to break any laws. But electricity travels through wires. And so we can see the wires going into this facility from a public road. Can we figure out how much electricity is going through a wire? And we're like, how do we do that? And we're like, okay, well, and this is where it helps to kind of be, to just be a jack of all trades, know a little bit about a lot of stuff, but definitely not enough that you can do any of these things on your own. We're like, okay, the electricity has something to do with magnetism. Like, can we just figure out, can we put a bunch of magnets around the electricity pole and figure out how much electricity is going through those wires? And it turns out we actually found some academics that had done this kind of research and had built some tools to do something similar for very different purposes. We're like, okay, maybe we can repurpose some of that to solve this kind of a problem. And so there are tons of examples. And the reason like that I'm, you know, I'm sharing these is that it's, it's, there's just a broad range. And the only limitation is like what you can imagine and obviously, you know, the law because we don't want to do anything illegal. I think it's super fascinating. I didn't know you guys are such agents. You're like the Mossad in domestic Mossad. I think that's, that's pretty, I'm mesmerized. I didn't expect that. And in terms of data collection, when I look at big data sets, I'm really astonished. People that track airplanes, right, flight stats and fly to where I always wonder how they did it. And it seems it's these, this data is basically public because airplanes have these transponders. And all you need is a certain coverage of antennas. So they send out antennas to people for free with a certain agreement that they have to share that data once they collected from that antenna. And it is actually that expensive. Maybe it's a million dollars, maybe half a million dollars, but a global coverage of any airplane that flies, which is very useful data, was really cheap with that model that they went for. And I think Cloudflare, you know, a big service provider for a content management system, they did something similar. They literally went to data centers, talked them into a sweet deal, just sent them, they never, they know, actually traveled or did any installation to just send them a box. And then that was all the installation they had to do. And then, you know, now they have all the data about web traffic, that's, they give it away for free. So they must make a lot of money from selling that data. I hope they still producing about two or three hundred million dollars of losses every year. But someone is maybe with a certain kickback scheme in the future, really interested in that data. When you think especially of web data, we all hopefully have by now seen the social dilemma that documentary and Netflix that kind of makes the impression that we all just as puppets, this marionette that Facebook and Google know so much about us, they can literally predict our next action better than, than we do. And I had Blaze on and he runs the Ripper and he says, well, we do have this data, but it's all siloed out. We can't, there's so many dimensions to these forecasts. So yes, we are working on this. And, you know, it makes a lot, a lot of money. But generally, the models that we use, they are like five dimensions, 10 dimensions. We can't model you so well, because we don't have a thousand dimensions or two thousand dimensions. What do you think is actually going on with Google and Facebook? Are they so well, well integrated now with all the data to get from all the different sources? And I know they've been one of the biggest buyers of data that they can make really good predictions of what we're going to do next. And we all have these scary moments, right? When we set something to Alexa, and then the next day we got these really creepy ads. Yeah, yeah. So I have a couple of comments to that. So on one hand, these models are, you know, they clearly, they're spending tons of money. There's a lot of, a lot of smart people that are trying to figure this out. On the other hand, it's such a hard problem, such a hard problem. And to actually do the things and have that kind of control and the kind of predictive power on an individual level is we're nowhere near that at the moment. And just even from the things that we're doing ourselves on this, it's like, I wish I could do that. Like if all the scary things are being portrayed in the media, I was like, oh, you can predict my next action. I can't even predict my own next action, right? Like there's so much, there's so much noise. And it's such a such a difficult problem that if anybody thinks like, I would hire anybody who can come to me and just say like, oh yeah, these companies are doing it. It's like so easy. I want to do it. Like I would pay you a lot of money to come replicate that for us. And so that's one thing. It's like these things are constantly making mistakes. They're constantly, they're constantly not optimized enough to be able to do these things on their own. The only thing I'll say is that the problems that have been able to have been solved and are really like impressive looking are very constrained. So you think about, you know, the the the algos that are playing, you know, chess, they're playing go, they're playing these things that are and even go like the evolution from chess to go, they're making a big deal in the media, rightfully so, which goes a much harder game than chess, the number of possible moves or orders of magnitude larger than the moves in chess. But it still is a very, very, very restrained problem. So you have like these these restraints, you can only move in a couple of directions or certain rules that that govern the game. And then just a matter of like running a ton of iterations through that, which computers are really good at. It's the same thing with with all these other fields, right? Like to talk about like Google and Facebook, like, they're just advertising companies, they're glorified advertising companies. Yes, they've gotten really, really good at figuring out what you're going to buy and like in getting you some kind of an ad that you're then going to convert on. And by the way, like even that, that sounds like that is their only business and these are the massive like tech companies. It's the only thing they do, the only thing they really care about, right? That's how they make money. And even then they kind of suck at it, right? There are all these stories with COVID when I think it was Uber, when these companies shut down their advertising on on social media, because they just, you know, they weren't operating. And it made no difference that they were shutting down like 100 millions of dollars on these platforms. And it was making no difference. And obviously Facebook and Google, they have these like attribution algorithms, but they're super incentivized to tell you, like, yes, this one ad generated like a bazillion dollars for you. And I know this from some of the corporates that we're working with, by the way, like we, we ran Facebook campaigns and Facebook. And this was like a fantastic example that could not be true. The there was a promotion like a Black Friday sale, the day before the sale started, there was an email that was sent out to an email list. So only people on that email list got access to coupon code and they were able to buy. So we know for a fact that this on that day, it was all email marketing. And then there were there was, you know, like then we turn on the social media promotions and so on and so on. And so Facebook was taking credit for 100%. I'm not joking 100% of the sales that were happening on the day before you even showed someone Facebook ad. Why? It's well because the way that they do their algorithms is that they take credit for like, there's a window of time that they take credit for. So if you saw an ad now, and you bought something later in the future, they're going to say, Oh, it was because of my ad. And there's some truth that right. But then, but in this case, it was so obvious that this stuff was like, not even we're even showing them ads the day before, there may have been some like legacy things that over the past, whatever, before the sale started, they were seeing some of these ads, but it was like so clear to me that that was that was not true. So the point is that even in that example, where it's super, a super tight problem or like, I'm just going to try to figure out like what ad correlates to what product, there's still a lot of money that gets wasted. Because these ads, if you look at the conversion rates and how much companies are paying for their ads to convert, it's not it's not one to one, it's not like you're, it's not like they've gotten it so good that you spend you're making your printing money with every every ad that you're taking out. And that is if that's not a compelling argument as to why machine learning and AI kind of still sucks, I don't know what is, right? Like they're spending all this money, yes, they've gotten a lot better at it, but they're still pretty terrible in the grand scheme of things. It's better than the TV advertising and so on, let alone we're trying to extrapolate and now say, oh, these companies are now going to know X, Y and Z about me that has very little to do with with advertising. So it's a really, really hard problem. And it's also, these are the kinds of problems that are fun to work with, but we're definitely not there yet. This isn't the kind of dystopia where these big tech firms know everything about you and they're doing all these terrible things with it. It seems like looking at it from the outside because they reach such a wide swath of the population. So imagine an Uber ad, I've seen it probably 50 times by now and I've been a user for so long, I don't know why they still show me the ad, right? I don't get it. And what else can I do? I already give them my money. But people feel that they are in that business of accumulating the data, they just don't know what to do with it yet, right? So you're kind of in the opposite problem of you were with the data center where you knew the problem that you wanted to solve, you just didn't have that data, but they get this massive amount of data streams. They don't know how to make use of them in a proper way. I agree with you. That's certainly our fears are ahead of what's real right now in the more science fiction, but it's easy to see if you can see every single email, if you can analyze the email in a way, in a cognition way that another human would do it, right? If you are the detective and you say you want data on a specific person, right? So then I understand that's rarely the case, but you would want access to the email, you would want access to the browsing logs, you would want access to when they do what, at what geolocation, you, all this data that we offer them for free, it's, they already have it in their hands now. They might not make perfect, perfect use of it, but I, if I want to abduct someone, this is the kind of information I want. I want to gobble up all this information, know all the routines so I can like surgically pinpoint my abduction plan and we'll go to plan because, you know, people have habits, they don't, they don't do, they don't behave randomly. And maybe they're not there yet, but maybe it's only 10 years from now, 15 years from now. It's just, it seems to take only better computers and bigger models and more algorithms and a new GPT five. And then you're there because the data is being given to them on a, on a silver platter. I think, yeah, I think that's, that's, that's all true. I will say that the other component to this before making all this stuff possible is for better or worse is you're still humans and human organization is hard. I've seen this, I've learned much more about it now that I started my own company and it's, it's, you know, it's, it's kind of funny because even though I started the company, I run the company, I'm still on a daily basis. Like the joke is like, I wish I was a CEO of this company because like people don't do what you want them to. It's, it's, you can't, it depends on your style, but you, you're not this like authoritarian leader that can impose their will through fear on their people. You still have to like collaborate, you have to incentivize people, you have to design all these kinds of mechanisms. And over time, what happens with big companies is that there's this inevitable slide towards bureaucracy and there's more complexity, more silos, different groups are finding it hard to communicate with each other. I'll give you a fun example of this. I was actually just talking to someone recently who was telling me, so I mentioned the, the, the credit card data that we're using, right? These are a handle of about, let's say like one or two percent of the transactions out there. It comes from a couple of like banks throughout the country. And the stories I got, one of my friends was talking to someone at that bank that was working in a data science capacity. And they were talking about like, like the person at the bank was like, oh, they're trying to answer these kinds of questions. And then my friend was like, why don't you just use your, your credit card data that you guys have access to. And, and the guys like, oh, there's, I can't access that. There's all this like legal stuff that we just don't have. We don't have access to it. And he's like, oh, it's funny because I do. And it was true. Like we literally had access to his banks credit card data through this panel that we're purchasing completely legally through like everything was permission. Everything was, was totally kosher, but the people internally had no idea this even existed. And they were just like, they're like blown away that this, this kind of stuff happens. I think the point is that, yes, like the technology can get through the data can get there, but you will still have these issues of, of how do you structure a corporation? How do you structure the right kinds of incentives? Like what is the incentive for someone to actually like, why do they need access to this data? What are we going to do with it? What are the questions they're looking to answer? And I think to kind of take a step back here, I think it's important to talk about some of the privacy and the legal concerns around all this. And it's, there's a, you can paint a very scary picture about all the things that can happen. And one of the things that I think from our standpoint, which is for the first time, probably ever, you can say that finance, the finance industry is one of the more righteous players in this space because, and this is going back to our, you know, stereotypical like sociopathic tendencies. Like we don't care about people. Like we don't care about you as an individual. I'm not using this data to spy on you. I'm using the data to spy on a company. So individuals are just annoyances that we then have to aggregate in our ultimate source to try to figure out what's happening with, with these companies. We're not trying to get you to buy our product. We're not trying to get you to vote for our candidate. It's all just like we're trying to figure out what is actually happening. We are observers in the real world trying to figure out what's going on to the common denominator between Wall Street and communism. Exactly. Exactly. Yeah. Yes. I'm joking. Go. Sorry about that. No, no, no. There's definitely, yeah, there's a lot of like funny things about that. Actually, the like some of the, I've actually been thinking a lot about this. I was originally born in Romania when it was communist. And so I grew up under the Soviet, you know, like Iron Curtain. And going back, I mean, ther

Matei Zatreanu (How alternative data is changing the world of finance?)

✈️ Save Up to 90% on flights and hotels

Related

Latest