![]() SuppressPackageStartupMessages(require(tidyverse)) Let’s fire up R and load the required packages plus our data. Ok, let’s structure this a bit: in order to use random forest for time series data we do TDE: transform, difference and embed. Feature engineering (lags, rolling statistics, Fourier terms, time dummies, etc.)įor brevity and clarity, we’ll focus on steps one to three in this post.Time Delay Embedding (more on this below).Detrending (differencing, STL, SEATS, etc.).Statistical transformations (Box-Cox transform, log transform, etc.).This can include some or all of the following: In order to make it ‚learnable‘ we need to do some pre-processing. Let’s do it! Getting ready for machine learning or what’s in a time series anyway?Įssentially, a (univariate) time series is a vector of values indexed by time. To stick with the topic, we’ll use a time series from the German Statistical Office on the German wage and income tax revenue from 1999 – 2018 (after tax redistribution). These are cornerstones of ARIMA modeling, but who says we can’t use them for random forests as well? Here’s how we are going to pull it off: We’ll raid the time series econometrics toolbox for some old but gold techniques – differencing and statistical transformations. Let it be said that there are different ways to go about this. This blog post will show you how you can harness random forests for forecasting! All it takes is a little pre- and (post-)processing. So, should we go back to ARIMA? Not just yet! With a few tricks, we can do time series forecasting with random forests. Thus, they’re unable to predict values that fall outside the range of values of the target in the training set. To understand why, recall that trees operate by if-then rules that recursively split the input space. What’s more, random forests or decision tree based methods are unable to predict a trend, i.e., they do not extrapolate. This assumption is obviously violated in time series data which is characterized by serial dependence. On the contrary, they take observations to be independent and identically distributed. How come? Well, random forests, like most ML methods, have no awareness of time. When it comes to data that has a time dimension, applying machine learning (ML) methods becomes a little tricky. You probably used random forest for regression and classification before, but time series forecasting? Hold up you’re going to say time series data is special! And you’re right. Random forest is a hammer, but is time series data a nail? Long story short, it’s one of those algorithms that just works (if you want to know exactly how then check out this excellent post by my colleague Andre). (2015) call it the ‚off-the-shelf‘ tool for most data science applications. Well, you and I may both agree that random forest is one of the most awesome algorithms around: it’s simple, flexible, and powerful. These are tried and proven methods, so why use random forests? When dealing with tax revenue, we enter the realm of time series, ruled by fantastic beasts like ARIMA, VAR, STLM, and others. What could taxes and the outdoors possibly have in common? Well, I asked myself: can we predict tax revenue using random forest? (wildly creative, I know). Man, I thought, that sucks, I’d rather spend this time outdoors. That explains why my colleagues at STATWORX were less than excited when they told me about their plans for the weekend a few weeks back: doing their income tax declaration. Benjamin Franklin said that only two things are certain in life: death and taxes.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |