Part 3: Predictive Modelling and Leveraging the Results
By Andrei Popescu, 4Cast Product Owner
Welcome back the 3rd installment of this series on data analytics as a key to the future success of the Energy sector. Hopefully you’ve found it informative thus far, and if you haven’t already I’d invite you to check out the previous 2 sections:
Part 1: What is Predictive Analytics
Part 2: Data Mining and EDA
Just to recap quickly, we discussed predictive analytics broadly as a variety of statistical techniques that help us to analyze current and historical facts in order to make predictions about future events. We then outlined the data set we had to work with and some specific steps which we would tackle to increase our understanding of the drivers behind production:
Last week we took care of Steps 1 through 3 (and to some extent 4), and came up with the following variables to move forward with:
Feature Variables: Proppant Concentration, Fluid Concentration, Stage Spacing, Total Number of Stages, and Average Porosity
Target Variable: 12-month Production (normalized for lateral length)
So, let’s pick up right where we left off and move to pre-processing our variables. Depending on your definition of data preprocessing, many aspects of it can certainly be considered a part of the initial data mining process which we discussed last week. For our purposes we will generalize data preprocessing as referring to one or more of the following tasks:
Transformation - categorization (binning) or continuation
The above is by no means an exhaustive list, simply a representation of some of the techniques I’ve found most useful in my projects.
We discussed some aspects of data cleaning and organization in the last article (Part 2: Data Mining and EDA) when we reviewed the data structure of 4Cast. If you’re creating your workflows and preparing data for modeling in Python or R, you will need to spend whatever time is necessary compiling your data into some form of a spreadsheet (usually one or more CSV, txt, or JSON files) and use some custom scripting to combine the data into an organized format where your feature variables can be directly related to your target variable. None of this sounds fun or exciting, but without getting on my soapbox again (see last week’s article) this is the most critical step to ensuring success throughout the rest of the workflow, with few “shortcutting” opportunities.
One incredible advantage that we have working in 4Cast is that simply by using this as our analytical platform, we are working with a data set that is properly cleaned and organized on Solo Cloud, and very easy to QC in 3D (see below). In addition to this, simply by executing regular operational workflows (drilling/geosteering wells, recording log data, etc.) we continually add more and more data to our cloud database to help our future modeling efforts without any duplication of work or interruption to operational execution.
Fig. 1 - Input data structure for predictive modeling. Feature variables outlined in Green, Target variable outlined in Red.
Fig. 2 - Input data visualized in 3D. Wells are coloured based on their calculated proppant concentration (calculated last week). Cooler colors are lower proppant concentration, while hotter colors are higher.
So given that data cleaning is done, let’s look at some of the other operations. Imputation, or replacing missing portions of data with substituted values can be a very useful tool to fill in gaps in data. Some common ways of going about this would be to substitute either the mean or mode of the data we have available in the places where it isn’t. For example, if we had proppant volumes available for 95% of our wells, but not the last 5%, we could calculate either the mean or the mode of the data we do have and substitute it in for the Wells where it’s missing. In our case, the largest gap we have is actually in the production data we have available - every well with production data has all of the other variables (features) fully defined. We’re certainly not going to substitute production values for our missing Wells since the whole purpose of this exercise is to try and predict production in the first place.
Data transformation can be very useful as well. Some common methods for transformation include categorization (binning), and continuization. The two processes are essentially inverses of each other, but let’s outline an example of categorization as I find it to be the more intuitive of two. Imagine we had a data field such as “Facies” available, and we may have our different facies types numbered as Facies 1, 2, 3, etc. This could present a number of problems when it comes to training a machine learning model. For one thing, the model won’t know to treat this value as discrete in the first place, so it may output a result that says the optimal well should be drilled in facies 2, which isn’t a reasonable or useful output. Furthermore, the model could go even further rogue and assume that the order and magnitude of the numbers actually matter. Categorization can eliminate these issues by allowing the model to treat these types of variables as discrete values, and not assign any importance to the order of magnitude of them. With our data set, we are again in an easy position as all of our variables are continuous, though if we had a Facies Log in either our Typewells or Lateral I would certainly use the operations in 4Cast to include this as an input (much like we did with Porosity).
The last step I mentioned above is feature scaling, also sometimes referred to as “normalization” or “standardization”, and we will go ahead and apply this type of preprocessing to our data. Feature scaling can take a number of different forms. For example, data normalization would take all of our training set data and scale it so that all the values fall between 0 and 1 (or sometimes -1 and 1). Standardization is similar but sets the mean of the data to 0, and the rest of the data falls within 1 standard deviation. We will be applying normalization to our data set, and I’ll cover exactly why we want to do this in a moment, but first I want to be very clear about the order of operations. When training a machine learning model, we will be using a “Training” data set, and a “Testing” data set. It is important to split the data into these two sets prior to applying feature scaling. This is because we do not want the data which we use to validate our model to influence the scaling algorithm applied to our data, since the idea is that this is brand new data that has never been seen before. So what we will do is apply feature scaling (in our case normalization) to the training data set, and then when we run the testing data set to validate our model, we will simply use the same exact normalization algorithm on our testing set as we did on the training set - regardless of what values are actually in the testing set.
So why are we bothering with feature scaling? As you’ll see in the upcoming steps, we’re going to evaluate 2 types of models simultaneously - a Random Forest model, and a Multiple Linear Regression model. The reason we need to apply normalization to our features is very clear if we take a closer look at the equation upon which our multiple linear regression model will rely. The equation has the following general form:
y = b0 + b1 x1 + b2 x2 + … + bn xn
So in our case, y is Normalized production, and x1, x2, etc. are stage spacing, porosity, proppant concentration, etc.. What does this look like if we take one of our data points at random, and plug it into this equation:
2899.8 (Norm prod) = b0 + b1(0 prop conc.) + b2(0 fluid conc.) + b3(54 stage spacing) + b4(57 stages) + b5( porosity)
Our input/output values range in magnitudes from 10^-2 for porosity, to 10^3 for Normalized production. With this type of variance, it will be incredibly difficult to accurately determine which factor is playing the greatest importance in predicting the Target (i.e. which bn matters most). By applying feature scaling, we set everything on the same magnitude as all of our values will lie between 0 and 1. We can see below the results of applying a simple normalization algorithm to our data set in order to normalize our variables to a scale of 0 to 1:
Fig. 3 - data prior to feature scaling (normalization)
Fig. 4 - data post feature scaling
Awesome. So now we have all of our data cleaned up, organized, and preprocessed. Let’s build some machine learning models! As I’ve been doing all of my work in 4Cast, I’m using Orange as my platform for applying these algorithms and building the models. 4Cast and Orange have a seamless connection which allows for all of this to be done incredibly quickly, and in a visual format - no complicated code required! Below is a snapshot of my workflow.
The first few steps are fairly straightforward - we are essentially defining the feature and target variables as we have outlined previously. Next we use the data sampler to split the data set into a training set (75% of the data) and a testing set (25% of the data). From there, we normalize the training set only - again, we don’t want our testing data to have any influence on the normalization process. Instead, we preserve the normalization function applied to the training and pass it to the model for application to future data.
Fig. 5 - model building pipeline. Notice that we split the data into a training and testing set prior to applying feature scaling. The scaling methods are then also passed to each model for use with future data inputs. Each model is evaluated using 10-fold cross-validation, then the model predictions for both the training set and the testing can be compared to the historic data in cross plots (below).
As you can see above, we have several tools for evaluating our models. The 10-fold cross-validation gives us some fast, quantifiable summaries of the results. Both models show reasonable correlation coefficients, with the Random Forest model edging out our Multiple Linear Regression by a bit.
Fig. 6 - Results of model cross-validation. R2 (R-Squared) shows the Random Forest model to be a more reliable model for predicting our target. On the right is a rank order of the importance of each variable in the models - overall, the main drivers are the total number of stages and the proppant concentration.
The last thing we want to do in order to test our models is to present them with brand new data that they haven’t “seen” yet - this is where our testing data set comes in. If you refer back to Fig. 5, you’ll see two outputs coming from the Data Sampler. The lower one represents our training set which was used to train these models, the upper one is the remaining 25% of our data which we will now pass to our models to generate new predictions. If our models are robust, they should be able to take this brand new data and accurately predict the normalized production based on the input variables. I’ve summarized the results in the scatter plots below:
Fig. 7 - Results of Test data set. In both graphs, the X-Axis shows the actual (historic) Normalized Production values. The Y-Axes show the Random Forest predicted production (top) and the Linear Regression predicted production (bottom). Both models do a reasonable job of predicting the production based on the inputs, with the Random Forest model showing a slightly better correlation.
There you have it! We’ve now generated two perfectly viable machine learning models which can be used to predict Length Normalized Production based on the following inputs:
It seems a little too easy, right? The reality is that 4Cast and Solo Cloud did most of the heavy lifting when it comes to the hardest part of this whole process which is data management and organization. I would also be remiss if I didn’t mention that there is certainly much more work that can be done to further refine these models, not the least of which includes expanding our data set to include more data points, and also more refined data types. Given our data set here, the main drivers are overwhelmingly the Total Number of Stages and the Proppant Concentration, but there is a myriad of other potential variables we can (and should) include in this type of analysis. For example, if we had seismic attribute maps, we could sample these values to our Wells to see what effect that property has on our outcome. Perhaps we should include the completion timing or order in which the wells were drilled as a variable. We haven’t even considered parent-child relationships, or whether any of these Wells were knocked down by offset frac hits. Needless to say, the possibilities for taking this simple workflow to the next level are nearly endless, but now we have a platform within which we can explore all of these possibilities ad nauseum.
I’ll leave it there for now as I think I’ve gone on for long enough. I’ll be back next week to preview an incredible tool that can help us make efficient use of our new models in future decision making. As usual, please feel free to reach out to us at firstname.lastname@example.org or contact me directly if you have any questions, comments, or simply want to discuss. Thanks again for your time, and looking forward to keeping the discussion going!