Dec 2, 2020

**Part 4: Applying Predictive Models to Enhance Decision Making****By Andrei Popescu, 4Cast Product Owner**

Welcome back to the 4th and final edition of this series on data analytics as a key to the future success of the Energy sector. It’s been a little longer so far than I thought when I initially set out to write these articles, but in truth, we have barely scratched the surface of this deep and complex topic. If you haven’t already, I’d invite you to check out the previous 3 sections:

Part 1: What is Predictive Analytics

Part 2: Data Mining and EDA

Part 3: Data Modelling

Just to recap quickly, we discussed predictive analytics broadly as **a variety of statistical techniques that help us to analyze current and historical facts in order to make predictions about future events**. We then outlined the data set we had to work with and some specific steps which we would tackle to increase our understanding of the drivers behind production:

- Define the
**target**variable that we want to predict which can help inform our strategic decisions - in this case, our target will be**Length Normalized 12-month Production** - Compile and visualize the available data in a consistent format, and one that can be directly compared to our target - since 12-month Production is measured at the Well level, our data should also be organized and reported at the Well level
- Identify existing trends/patterns within our data, and define dependent relationships
- Select and preprocess input variables - these input variables should be quantities that can be known with a high degree of confidence prior to drilling new wells
- Build, test, and refine data models until we have one which can accurately predict historic results based on the defined input variables
- Simulate a large number of potential future development options, and use the data model to predict the results
- Identify the simulated option which is predicted to achieve the optimal result

Last week we took care of Steps 4 and 5, and we trained both a Random Forest and a Multiple Linear Regression model to predict **length normalized 12-month production** using the following variables as inputs:

**Feature Variables**: Proppant Concentration, Fluid Concentration, Stage Spacing, Total Number of Stages, and Average Porosity

So where do we go from here? We now have a model that can predict our future results with a reasonable amount of confidence, so how do we best utilize this model in our future decision making? We could of course manually run a handful of different potential development scenarios through our model, and see how they are expected to perform. If the scope of our future development is fairly limited, this may prove to be good enough, as it allows us to run each specific scenario and see which of them is predicted to achieve the best results. What happens, however, if the question we are faced with is much broader? What if we need to recommend the location and design for the most cost-effective wells to Management as opposed to choosing between a few different pre-planned options?

In this case, there is a myriad of different possibilities we would want to consider in order to ensure that we provide a robust and rigorously evaluated recommendation. As you can imagine, even with the help of our predictive model, manually designing and testing each individual scenario would be incredibly time-consuming, and would undoubtedly leave a large number of potential options on the table un-tested.

Fortunately, 4Cast can come to our rescue once again in this situation. We have at our disposal a tool that allows us to simulate an immense number of different scenarios (upwards of 2 million) very quickly and easily. What this means, is that we can model the results of millions of potential scenarios in a matter of minutes, and then spend our time where it really matters - identifying which of those scenarios is most optimal based on the known constraints of our upcoming development cycle.

Before we move along in our workflow and use 4Cast to simulate our potential future development, let’s discuss the theory behind this algorithm a bit. I want to be perfectly clear that we aren’t going to be generating random or arbitrary well parameters to run through our model as this wouldn’t be effective or useful. There are 2 main reasons that we can’t employ a simple randomization algorithm to generate our simulated scenarios:

- Our model would be ineffective at generating predictions using random inputs since the inputs would deviate wildly from the data we used to train our model.
- The simulated scenarios wouldn’t be at all realistic, so evaluating these possibilities is a waste of our time. For example, if we simulated potential well parameters using strict randomization, we could easily end up with the following inputs:
- 0.1 Tonnes/m proppant
- 1.5 Tonnes of proppant/m3 fluid
- 200m Stage spacing
- 85 Stages
- 3% porosity
- This scenario is complete nonsense. All other variables aside, if we look just at 85 Stages with 200m Stage spacing that comes out to a 17,000m Well!

To make the best use of both our model and our time, what we want to do is evaluate scenarios that are both realistic (i.e. we could actually see ourselves drilling them), and exist reasonably within the bounds of our training data. Don’t get me wrong, we definitely want our simulated scenarios to deviate from what we’ve already done, otherwise why simulate new development in the first place? But we want to make sure that the range of scenarios we generate is at least somewhat within the range of our training data. After all, how can we expect our model to be accurate if the inputs we present it with are completely unprecedented for it? We created a machine learning model, not a self-aware AI :)

The best way to illustrate how 4Cast can help us achieve this is to simulate a small number of development scenarios first, say 150 of them, and compare those to our existing Wells (see below). The inputs for this function are all of our historic data points (same data we used to train the model), and the output is a set of feature variables (proppant concentration, fluid concentration, stage spacing, # of stages, porosity) for 150 potential future Wells. Below are a series of scatter plots showing the relationships of those variables within our historic data, and those same relationships in our simulated data.

*Fig. 1 - Scatter plots comparing the total # of stages (x-axis) vs. proppant concentration (y-axis). The top plot shows the trend for our existing development (wells in our project area), while the bottom plot shows the data we simulated using the multivariate interpolation algorithm in 4Cast. Notice that the overall trend from our real data is preserved in the simulated data, but we have many more points which fill in the gaps of our actual data.*

As we can see from the plots above, the multivariate interpolation algorithm does an excellent job of preserving trends that are underlying within our existing data, while also filling in the gaps in our data set which will help us produce a wider and more useful range of predictions. If we plot the remaining variable pairs in a similar fashion, we will continue to see this same pattern where we get clusters of simulated data that have similar parameters to our existing ones but differ just enough to give us a full range of realistic development possibilities. This is an incredibly powerful tool as it allows us to hypothetically execute any “design tweaks” we are thinking of applying, and see what the result of those changes is predicted to be by utilizing our predictive model. As we first discussed way back in the very first part of this article series: **if we have a reasonable expectation of what the impact of our proposed design changes will be, we can make better and more informed decisions with respect to what design changes we actually want to commit capital to and execute**.

So now that we have this incredible tool at our disposal, we have the freedom of simulating and evaluating virtually limitless different development options in order to identify which one(s) are optimal. For this evaluation, I’ll use the multivariate interpolation algorithm to simulate one million potential new Wells. When I say that we’re “simulating new Wells”, we are of course simulating new sets of Feature variables. Once we employ this algorithm, we will have one million new unique sets of: Total Number of Stages, Stage Spacing, Proppant Concentration, Fluid Concentration, and Average Porosity. If we had built our model to consider different inputs when generating predictions, we would of course want to generate those inputs instead.

We can now take these unique data points and run them through the predictive model we built in last week’s article. This will give us a broad range of possible outcomes to weigh against each other. As you can imagine, with 1MM different scenarios and outcomes, evaluation can be a bit tricky. Fortunately, 4Cast has us covered once again! We can use a heat map to help consolidate all of the information we have into a form that is more manageable and useful to us in terms of making decisions. One of the strengths of the heat map is that **it allows us to compare the variables over which we have control against each other, while simultaneously setting constraints for the variables which we do not have direct control over.** In our case, we actually have control over most of our input variables (proppant/fluid concentration, stage spacing, # of stages), with the porosity being the only constraint that we can’t directly influence. Let’s say in our potential development areas, the porosity ranges from ~2.5% - 4%, and we want to evaluate what the best options are for setting our other parameters up to maximize length normalized production while minimizing cost. Below we can see the optimal design for maximizing length normalized production is predicted to have 55 stages, 0 tonnes/m proppant concentration, 0.2 tonnes/m^3 fluid concentration, 54 meter long stages, and should be drilled in 4% porosity:

*Fig. 2 Heat Map showing results of one million different predictions from our Random Forest model. Each bin’s color represents the length of normalized production. In this case, the x-axis shows the total number of stages and the y-axis shows proppant concentration. The Filters on the right can be used to set up any constraints that are present in the other variables (in this case, we are seeing only results based on the porosity of 2.5% to 4%). Hovering over any of the bins shows the average parameters of wells that fall into that range of normalized production.*The Heat Map in 4Cast allows us to swap between the various input variables we have defined our model to use on the X and Y axes, while also actively filtering the remaining variables based on any other constraints that may exist with regards to our development. If, for example, we wanted to investigate optimal ratios of proppant loading per meter with fluid concentration, we can simply change the axes of the map display and see how these parameters look when cross plotted.

*Fig. 3 - Same plot as above but looking at a fluid concentration (x-axis) vs. proppant concentration (y-axis). The highest production per meter is predicted with a proppant concentration of 0 tonnes/m and a fluid concentration of 0 tonnes/m3 of fluid. *

So there you have it, we’ve gone through and executed each of the steps we outlined back in week 1. I assure you that while this article series was prepared and published over the course of a month, the process itself was relatively quick and streamlined by utilizing the power of Solo and 4Cast. Depending on the initial data set you have available, an analysis like this could be reasonably carried out in a matter of days or even hours! I realize that this analysis wasn’t by any means exhaustive, and realistically there are many more variables and inputs we would likely want to consider, however, the concepts we discussed and the steps we carried out are more or less the same even with a more complex and varied data set. Off the top of my head, here some additional inputs which would likely be very useful to include in our analysis if we had access to them, and may warrant further investigation:

- Geomechanical parameters and/or geophysical attributes for better reservoir characterization
- Regional pressure trends
- Observed drilling characteristics (mechanical specific energy, etc.)
- Completion method (plug and perf vs. sliding sleeves, cluster design, etc.)
- Observed frac hits/production interference
- Cost data to help better identify optimal ROI (optimizing for the net return instead of production)

The list could go on for pages from here as there are so many different variables that can affect the outcome of our Wells. The idea is that by utilizing the approach and methods described in this article series, we can start to compare these various parameters (across disciplines) against each other and start to identify and rank order the relative importance of each. By doing this, we will not only gain a better understanding of our reservoir in a broad sense, but we will also be able to employ more methodical strategies to “engineering” our completion design in order to achieve the most optimal results.

I’ll end the article series here for now, with the caveat mentioned above: there are certainly many more variables at play which we could consider and use to provide a more in-depth and robust analysis. Please feel free to reach out to me directly or leave some comments below if you’d like to continue the discussion! If any of what we’ve discussed has been of interest to you and you’d like to learn more about 4Cast and how you and your Team can apply it to your own data, please contact us at northamerica@rogii.com to set up a demo and free trial. Thanks again for joining me throughout this series!