Part 2 – Data Mining and Exploratory Data Analysis (EDA)
By Andrei Popescu, 4Cast Product Owner
Let’s start by recalling the broad definition of predictive analytics from last week’s article: a variety of statistical techniques that help us to analyze current and historical facts in order to make predictions about future events. And now let’s take that definition a step further by pointing to the specific statistical techniques I’m referring to. I believe we can break down the process into 4 key components:
1. Data mining (collection) and clean-up
2. Exploratory Data Analysis (EDA)
3. Training/refining data models
4. Generating and evaluating predictions
Something I’ve seen commonly throughout my career is a heavy focus on getting to steps 2, 3, and 4 (the fun stuff), with as little time and resources spent on part 1 as possible – after all, we need to do more with less, and we need to do it quickly, right? I’ll take a hard stance here and say that part 2 will be very difficult, and the results of 3 and 4 will be mostly useless if we don’t change our mentality. Everybody’s heard the adage “garbage in, garbage out”, but the problem is that nobody believes that their data is garbage. The harsh reality that most don’t consider is that while their data may be excellent if it’s not compiled and organized in the appropriate manner, it’s as good as garbage. A Forbes study on actual time spent by data scientists would agree with me here, the graph below shows that roughly 80% of the process (and time/effort spent) is taken up by collecting, cleaning, and organizing data. Only then do we get to start having fun!
So let’s discuss the first component - data mining and clean-up. What this means is that our data not only needs to be compiled in one place, but it also needs to be organized in a consistent format. In last week’s article, we defined the following Steps we would be tackling:
1. Define the target variable that we want to predict which can help inform our strategic decisions - in this case, our target will be 12-month Production
2. Compile and visualize the available data in a consistent format, and one that can be directly compared to our target - since 12-month Production is measured at the Well level, our data should be organized and reported at the Well level also
3. Identify existing trends/patterns within our data and define dependent relationships
4. Select and preprocess input variables - these input variables should be quantities that can be known with a high degree of confidence prior to drilling new wells
5. Build, test, and refine data models until we have one which can accurately predict historic results based on the defined input variables
6. Simulate a large number of potential future development options, and use the data model to predict the results
7. Identify the simulated option which is predicted to achieve the optimal result
We also completed Step 1 on the spot - we decided that our target variable to be predicted would be 12-month Production. Great. This week we’ll focus on Steps 2 and 3. So, given this, our next course of action is to compile the rest of our data in a consistent format, and one that can be directly compared to our target - so we need to organize our data on a well level basis - this step is where 80% of the project time is usually spent!
Fortunately for us, this is exactly where 4Cast and Solo Cloud help to make this process both easy and visual. Most companies in North America are already using StarSteer to drill and geosteer their wells (and if you aren’t perhaps you should be 😊) since 4Cast shares a common database with StarSteer through Solo Cloud, the first step in building a project is already complete as we can just connect to an already existing project in Solo. By doing this, we instantly have access to all of the trajectory, log, and geosteering data which has been created, compiled, and QC’d as a part of the standard operational workflow.
Of course, we want to bring in and calculate additional data to help with our analysis. In our example, we’ll start with the following high-level data which is usually publicly available:
· Stage depths (start and end MD of each stage)
· Fluid/Proppant volumes per stage
· 12-month Production per Well
Fig. 1 - 4Cast data structure accommodates variables to be defined and stored either at the Stage or Well Level. There is no limit to the number of columns we can define/import/calculate.
As you can see above, the data structure in 4Cast allows us to store any data we have at either the Stage or Well level, and it will automatically use certain Stage level attributes to calculate new attributes at the Well level. In this case, we used the Stage depths to calculate Completed Length, and the fluid/proppant volumes per Stage to get a total value for each Well - remember, since our Target variable (production) is measured at the Well level, we need our Feature variables to be organized in the same way. It’s also worth noting, that these column definitions aren’t set in stone - we can import/define custom columns depending on the data we have available, and also calculate new ones using our existing variables, as well as our Log data and Geosteering Interpretations.
Given that we have access to our Geosteering Interpretations, as well as the Typewells and Logs that were used to steer these Laterals, a logical next step would be to calculate some new variables using this information. In this data set, I had the benefit of Bulk Density and Density Porosity Logs in my Typewells. 4Cast allows us to use these Logs along with our Geosteering Interpretations to model these properties along all of the Wells in the project very quickly and easily. The results can be seen here:
Fig. 2 - Average Density Porosity calculated on a per Well basis for all Laterals. Results range from 0.5% (Blue) to 4.5% (Red). All calculations are automatically output to the Stage/Well spreadsheets, the 3D Colour View allows us to quickly and easily visualize and QC any of the data from the spreadsheets.
As you can see, we are already well on our way through Step 2 - we have compiled our available data in an organized and consistent format, and we can quickly QC it to identify broad trends or outliers. In the case of Density Porosity (shown above) our Wells are modeled to have an average of anywhere from 0.5% to 4.5% porosity, which is in line with expectations for this formation. To recap, the variables we now have available for our Wells are as follows:
Feature Variables: Completed Length, Number of Stages, Total Proppant, Total Fluid, Average Porosity, Average Bulk Density
Target Variable: 12-month Production
With a total of seven relevant feature variables, many of which we can meaningfully control in our future development plans (how long we drill our Wells, how many stages we pump, etc.), we could reasonably try to start training a data model to predict production right away. But before we get too ahead of ourselves, it would be a good idea to do some simple Exploratory Data Analysis (EDA) in order to familiarize ourselves with the data set. EDA is a broad term that could refer to a number of different techniques, but for our purposes, we’ll keep it straightforward - we want to investigate and summarize the existing trends in our data using visual methods. The goal is to potentially refine our input variables further, prior to moving on to training and refining a statistical model. There are a number of ways we could go about this next step, but my personal favorite is simply to start creating scatter plots of our various variables:
Fig. 3 - Plot of Completed Length vs. Total Proppant. Points are sized based on the Number of Stages and colored based on 12-month Production. A strong correlation here is not unexpected but indicates multicollinearity in our data set, which could cause problems for our future model.
I’ll spare you the summary of all the cross-plots I created (suffice it to say there were many) and we can focus on the important learnings from this exercise. By doing this simple analysis, we uncover that we have numerous variables that are actually quite closely correlated. This is an issue called multicollinearity - where changes in one of our independent variables are associated with shifts in another. Below is a summary of the dependencies which stand out:
Fig. 4 - Completed Length is closely tied to Total Proppant, Total Fluid, and Total # of Stages. Additionally, Average Porosity and Average Bulk Density have a nearly 1-to-1 relationship.
Some of these dependencies should come as no surprise, however, it’s important to note these and think about the effect these co-dependencies will have on our statistical model. Given that a large part of our goal is to understand the effects that each individual parameter has on 12-month Production, we should consider modifying our input variables. The reason for this is that including too many co-dependent variables will make it difficult for the model to distinguish the individual effects of each variable because the co-dependent variables tend to change in unison.
Looking at porosity and bulk density, we can see that these variables have essentially a perfectly dependent (or 1-to-1) relationship. In this case, we would actually be best served to simply remove one or the other altogether to avoid overfitting the model. With regards to Completed Length, it has a fairly close tie to the Total Proppant, Total Fluid, and Total # of Stages in our Wells. Again, this is to be expected but is also a little concerning since we can pretty easily extrapolate these co-dependencies to conclude that any model we build will assign a massive amount of importance to these variables as a group. This will result in us learning that longer Wells with more Stages, Fluid and Proppant pumped will give us better Production, but won’t actually tell us which of these variables individually is most important. I’m not sure about you, but to me, that sounds like an altogether useless result. So instead of blindly plugging our data into a machine learning algorithm and blaming “useless AI” for not giving us relevant answers, let’s instead see if we can’t modify our data set somewhat.
A simple thing to try at first would be to normalize Completed Length entirely. So specifically, let’s calculate the following variables to replace the co-dependent ones we identified above:
Fluid Concentration = Total Proppant ÷ Total Fluid
Proppant Concentration = Total Proppant ÷ Completed Length
Stage Spacing = Completed Length ÷ Number of Stages
While we’re at it, since we want to normalize Completed Length out, we should also change our Target to be 12-Month Production on a per Length basis:
Normalized Production = (12-month Production ÷ Completed Length) * 100
Now, we can not only remove Completed Length from our input variables (since it is represented within our new Features and new Target), but if we look at a summary of our Input Variable dependencies, we see that our Inputs are far more independent of each other.
Fig. 5 - We can see the correlations between our new Target Variables are much lower than seen with the previous variables.
Our new data set is as follows:
Feature Variables: Proppant Concentration, Fluid Concentration, Stage Spacing, Total Number of Stages, and Average Porosity
Target Variable: Normalized Production
I think I’ll leave it there for this edition. To recap, we’ve compiled our data set, investigated some of the underlying relationships, and did some simple analysis and data manipulation to ensure that our input data set is one well-suited to predictive modeling, and promising in terms of giving us valuable insights to take forward. We’ll save data preprocessing, predictive modeling, and evaluation of our results for next week. Thanks again for your time in reading this, and I look forward to hearing from you and continuing the discussion. As always, be sure to follow us to get the latest and greatest on everything Rogii related, and please don’t hesitate to reach out to our Team, or to me directly if any of the content here has piqued your interest!