SAS Tutorial | Machine Learning: A Coding Example in SAS

SAS Tutorial | Machine Learning: A Coding Example in SAS

December 15, 2019 0 By Stanley Isaacs


Hi. I’m Christa. And I’m going to be
showing you examples of machine learning in SAS. First I’m going to be
starting off in SAS studio. This may be more
for people who are familiar in SAS programming. Second, I’m going to be
showing you the same model but in Model Studio. This might be more
interesting for people who aren’t as familiar
in SAS programming and are looking for
an easier way to do it but still want to
have the same control over the hyperparameters. This video is more
about showing how to do machine learning in SAS. So if you are interested
in the fundamentals or why we’re doing
certain tasks, we do have a
fundamentals in machine learning that you can
refer to in another video. All right. So now I have SAS Studio open. And we’re going to
go to the snippets. So underneath the
SAS Snippets, you can find a variety of different
snippets that you can use. I’m going to be using the SAS
Viya Machine Learning snippets. So I’m going to
scroll down a bit. As you can see here, there
are a few different options that you can use. You can compare two machine
learning algorithms. You can even compare several
machine learning algorithms. So if you were looking
to use anything specific, all you would have to do is
go to these little examples and maybe take that model out
and use it with your data. So if you were looking
to see how that was used, you can use these
examples to base what you’re trying to do out of it. So right now I’m going to be
using the Supervised Learning snippet. So this snippet is
showcasing a sample machine learning workflow using
the HMEQ data set. And so what the steps
of this are are we’re going to be preparing
and exploring our data. So we’re going to be loading it
in, exploring it, seeing what missing values it has,
partitioning it out, imputing those missing values,
and identifying variables that explained variance. Next, we’re going to be
performing supervised learning using a random forest. And then we’re going to
be evaluating and scoring our model. So I’m scrolling down. The first block of code here
is defining the macro variables for later use in the program. So what we’re going to be
doing is setting the output directory, so where we want our
temporary files to be written to, starting our CAS
session, and then specifying the data set name. So this is pointing
to where a data set is and renaming it SAS data,
CAS data, partition data. Next, we’re specifying the
data set inputs and the target variable. So we have our different
variables here. So these are all class inputs. The ones below are
the interval inputs. And they’re all listed here. And then last, we have
our target variable, which is the BAD. All right. So if we want to get a
quick look at our data, we can go to Libraries
and scroll down. So it’s in the SAMPSIO tab. I’m going to scroll
all the way down till we get to the HMEQ data. So here I’ve
already expanded it. You can see our
different variables. There are 13
different variables. If we click on it, we can
look at a sample of this data. So you can see a few
different observations here. One thing you might notice
is we have quite a few of these little dots. What that means is
the data is missing for that variable
for this observation. This is going to be
important and something that we look at later. So when I run this code– all right. So it successfully ran and
we started our CAS session. Now I’m going to be
loading the data set. So we’re loading the data
set using the variable names that we set earlier. I’ll submit and run. And that was successful. So we can see here
that there are 5,960 observations
and the 13 variables that we’ve mentioned earlier. All right. Scrolling down a bit
more, this section is all about exploring the data
and looking for missing values. So this is going
to be telling us the percentage of
missing values and which variables have the most. So I’m going to
highlight and run this. OK. So here we have a few
different sets of information. We have the data summary, which
shows us each of our variables, what level they are, so whether
they’re interval, class. And also if we
scroll to the right, we can see the number
of observations that they have
that we’re missing. So out of the total data
set, how many observations did that specific variable
have that didn’t have a value associated with it? We can also see the mean,
the max, standard deviation, and a few other traits of
our different variables. If we scroll down, we can
see the percentage of missing values for each variable. So here you can see that
DEBTINC is 21% missing for all of the observations. So there’s a large amount
of missing variables for this specific one. And we have a few other
that also have quite a bit missing values. But we’re going to
handle them in a minute. But first, we’re going
to partition the data into training and validation. So we want to have
our training set before we impute it because
we don’t want to mess with our validation set. So here we load in
the data that we have and we set our partition
to be 70% of the data. So I’m going to run this. And that is complete. So here we can see the
number of observations for each of the
category for bad, and also the number of
samples that we collected from each of those categories. So this represents 70%
for this partition. So now we’re going to get into
imputing the missing values that we looked at earlier. So we have our training set. And we’re going to take
the variables seen here. So here we have three. And we’re going to impute
those to the median. For these two
variables listed here, we’re going to impute
them for the mean. And then we’re going
to save all this data in a data set with the
tag prepped underneath it. So this is going to be used
when we create our model. Right. So here what this
generated is showing us how many variables we imputed
and what method that we used. It also shows us the variables,
their mean and median, so the value that was replaced
when we did the imputation. So each of those missing– each of those missing values
now have this value instead. So it’s important when you’re
looking at data to make sure that what you’re
replacing is something that you should be replacing. There could be
instances where you have a missing value where
you don’t necessarily want to replace it. It might just be another
level in your data. So before you do this
to your data set, make sure that this is
what you really want to do. And this is changing it to be
a more accurate representation of those observations. All right. So now that we’ve
done the imputation, we’re going to look and
identify variables that explain variance in the target. So what this is doing
is it’s identifying variables that are predictive
of our target variable. So what explains
why this variability is occurring in our target? So we want to try and
find those variables and limit the ones that
don’t explain variance so that our model can
run on just the most predictive variables and
not have any noise that might be associated with ones
that are less predictive. So I’m going to select this bit
of code, which also includes a plot that we can look at. So I’m going to run this. And so what we have here
is a few different summary information. You can see the
proportion of variance explained by each variable. You can see the ones
that were selected. And you can see a plot
of the variants explained by each iteration. All right. So now that we’ve
done this, we’re finally going to get to building
our predictive model using the random forest. So here we’re using
the procedure FOREST. We’re using our data set
that we prepped earlier. And we’ve specified a
few different inputs. So we have the number of trees
being 50, the number of bins being 20, and the minimum
leaf size being five. So we have to specify
which of our variables are the interval variables,
which we have here and that we named
earlier in the program. We have our class inputs. And we have our target. Here we’re specifying the
partition that we did earlier. And then we’re outputting
fit statistics for the model that we create. So let’s run this. OK. So we can see a little
bit of model information of the model we just created. So it’s the same information
that we put earlier, including some of the default values. So we have the number
of trees being 50. We have the number of
bins, the maximum depth. And we can also see the
misclassification rate. So this is currently 12%. We also see the split of
training and validation. And we can look at the variable
importance that was generated by running the forest model. So you can see which
variables were determined to be most important. This table is the fit
statistics that we generated. So you can look at the
training average squared error, the validation
average squared error, and you can also look
at those same things for the misclassification rate. So here you can see as we added
more trees– so to the left, you can see how
many trees we have– that our misclassification
rate for both the training and the validation went down. Now the validation
never quite reaches where the training was at. That’s expected. But we still have a relatively
low misclassification rate at approximately 13%. OK. So now what we’re going
to do is score the data using the generated model. So this is going to tell us
how our model was performing. So I’m going to
select this code. And this is going to give us a
plot of the misclassification rate per trees. So here you can see that
as we’re adding trees, we are getting a lower and
lower misclassification rate until around this
area, maybe like 18 trees. And we’re kind of
plateauing at this point. So there might be
benefits of continuing. We might have what we
see here like a lower dip where we are getting
a better rate. But overall, this
is plateaued out. And adding more
trees isn’t really helping us much in creating
a better predictive model. Next, we’re going to assess
the model performance. And we’re also going to
analyze the model using ROC and lift charts. So I’m going to select both
of these blocks of code. And what this is doing is
generating information related to the lift and the ROC. So here you can see various
information for this partition, for the lift. And if we keep scrolling, we’ll
also see the fit statistics. So what I’m scrolling down to
is the plot that’s generated. So here we can
see the ROC curve. So this is a relatively
decent curve. We have it plotted,
the true positive rate against the false positive rate. So this looks pretty decent. The validation, of course,
isn’t as good as the training. If I scroll down
a little further, you can see the lift chart. So this also plots
the validation against the training. So that is it for the
SAS Studio example. And now I’m going to be
moving over to Model Studio to show you a similar
model using machine learning pipelines. All right. So I’m in the SAS
Drive in build models. I’m going to go to
Create A New Project. So I’m going to load our data
set that we used in SAS Studio. So when you’re trying
to do this example, you’ll have access
to this data set and can just download
it and import it the same way that I just have. I’m going to name
the project example and start with a blank template. There are other
templates that you can use such as the basic
template for class target or the intermediate template. This will start you off with a
few different machine learning models. So if you want to start
off with that, feel free. But I’m going to start off
with a blank template that just gives me the data node. All right. Now that we’ve opened our
project and our data is loaded, it notifies us
that we must assign a variable with
the role of target in order to run the pipeline. So our target variable
is here named BAD. I’m going to switch its
roll over to target. OK. So now that we’ve assigned
BAD to the target role, we need to change a few of the
roles to their proper role. So based on the way
that SAS reads in data, it scans the first
few observations and then assigns what
level it thinks it is. So in this case, it has
assigned a few variables nominal when they should be interval. So I’m going to select
everything but job that is nominal and change
it to interval. So we have one more
than I need to change. Deselect the
previous ones first. All right. OK. So now our data
is how we want it. However, like we did in
the previous example, we’re going to want to
impute those variables. So the way that we
do it in SAS Studio is different from Model Studio. Here I’m going to
select the variables that I want to impute. So I’m going to
select three variables that we’re going to change
the imputation to be median. So that is this
one, YOJ, and CLNO. So I’m going to
change this where it says Impute to be the median. So this is similar to
what we did in SAS Studio except all we’re doing here
is selecting the dropbox. All right. For our other two variables
that we set to the mean, that would be the
CLAGE and the DEBTINC. All right. So I’m going to change
those to be the mean. All right. So now that we have
our data like we want, I’m going to go over
to the Pipelines tab. So the pipeline
initially starts off with just the data node if you
selected the basic template. So I’m going to expand
the nodes on the left. Right. So we have the Data
Mining Preprocessing. So if you expand that,
we have a few options for transforming or
exploring your data. You can do variable
clustering, selection. But we’re just going to be
dragging over the Imputation node. So this is the node that
actually imputes the data. When we set the features
earlier, what we were doing is specifying what we
wanted to happen when we had the imputation node in play. If you don’t have the imputation
node, this won’t happen. So we have this loaded up. If you scroll down, you can see
the class and interval inputs. These both have default
methods that we’re going to change
because we’ve already set which variables that we
wanted to be imputed and how. So I’m going to go ahead
and change both of those to be none. I’m also going to select
the Summary Statistics. So this is going to give us
an idea of what it changed. It’ll also tell you the
number of missing observations and what they were
replaced with. So I’m going to run the
pipeline real quick. All right. So our pipeline has
finished running. I’m going to open up
the Imputation node by right-clicking and
selecting Results. So here we have a
few different tables. The first table is the
Input Variable Statistics. So you can see the number of
variables that are missing. I’m going to expand
this real quick so we can see it a little better. You have the number
that are missing. You have the percent
that that represents and the observations
that we have. And you also have the
mean standard deviation and a few other features
related to that variable. So I’m going to
close out of this. Here you can see the imputed
variables that we’d selected. They now have a new name to
indicate that we imputed them. They also show you
the method that we used for the imputation,
and the value that it replaced, and
for how many observations that it did replace. Here are some other
potentially useful information. But for now, I’m
going to close out so we can get to
creating our model. So now that we’ve
imputed our variable– now that we’ve
imputed our variables, I’m going to open up the
Supervised Learning tab. So under this you can see a
few different models available. But I’m going to create a forest
like we did in the SAS Studio example. So I’m going to drag
it over and drop. You do also have an
option where you can right click on the node. And if you want to add
something under it, select Add Child Node. And then go to the tab. And you’ll be given a list of
things that would be under it. So if that’s an option
that you want to use, you have two ways of connecting
new nodes to your pipeline. OK. So here’s our forest node. So we see by default
it has 100 trees. We’re going to
change that to be 50 to match what we used
in the Model Studio– what we used in the
SAS Studio example. So here it just shows
you different options. You have the tree
splitting options so you can specify what the
class target criterion is so you can change that. You can also specify the maximum
depth, the minimum leaf count, so it’s five like we had
in our SAS Studio example. And you can also specify what
to do with missing values. So this is just saying go ahead
and use them in our model. All right. So now I’m going
to run our model. Now I could have
selected run pipeline, but I didn’t in this
case, because we only have one model here. So the Model Comparison
node wouldn’t provide us additional information
compared to just the results of the forest. All right. So now our model is ran. I do want to show one
additional thing before I look at the results. So if you were
interested in trying a few different
parameters, you can go to the Perform Autotuning tab. And you can just turn this on. And you’ll have
the same variables except now most of
them are in ranges. So if you were to
run this, it would test out multiple
hyperparameters and then present you with the
top 10 models, which you then could use to determine how
you want your model to be. So for this case,
we just ran it once with the default
parameters that we used by changing the
number of trees to 50. So I right-clicked. I’m going to select Results. And here we can see a few
different plots and tables. So this first plot here is
the average squared error. I’m going to expand this. To show you something similar
to what we saw on SAS Studio, all I’m going to do
is click the dropdown and change the
misclassification rate. So there’s a similar plot
that we showed there. But it also includes the
out of bag and test sets. So this split into
three different sets– training, validation, and test. And they’re represented
here and their validation or their misclassification rate. So if you look, these are
approximately the same as what we saw in the
SAS Studio example. The algorithms may be
slightly different. But overall, you can
see that it didn’t make that much of a difference. So we have around
8% misclassification for the train. And we have around 11%,
12% for the validation set. This table here is the
variable importance. So this was created
when we ran the forest. And it shows you the
importance of each of these variables to
predicting the target variable. So this was also something
that we saw in SAS Studio. But this is something that’s
autogenerated in Model Studio when you run your model. So you have access
to all of this without having to add any code
or run any additional code. It’s run automatically
when you run your model. So now I’m going to switch
over to the Assessment tab. And this gives us our lift
reports and our ROC reports. So you can see the plots here. You can scroll down and
also see the fit statistics. So this is all information
that we created in SAS Studio. But this is automatically
given to you when you run your
nodes in Model Studio. So if you like this video and
you want more tips like this, subscribe to our channel. If you want any
related information, you can check in the links
in the description below. And if you have
any comments, feel free to leave them or questions. And thanks for watching.