Welcome to our an introduction. I'm BartonPaulson. And my goal in this course is to introduce you to our This is our, but also,this is our. And then finally, this is our, it's arguably the language of data science. Andjust so you don't think I'm making stuff up off the top of my head, I have some actual data. Thisis a ranking from a survey of data mining experts on the software that they use most often intheir work. And take a look here at the top are is first. In fact, it's 50% more than Python,which is another major tool in data science. So both of them are important. But you can see whyI personally am fond of R. And why is the one that I want to start with introducing you todata science. Now there's a few reasons that R is especially important. Number one, it's free.And it's open source compared to other software .
Packages can be 1000s of dollars per year. Also,R is optimized for vector operations, which means you can go through an entire row, or an entiretable of data without you having to explicitly write for loops. If you've ever had to do that,then you know, it's a pain. And so this is a nice thing. Also, R has an amazing community behind itwhere you can find supportive people. And you can get examples of whatever it is you need to do.And you can get new developments all the time. Plus our has over 9000 contributed or third partypackages available, make it possible to basically do anything. Or if you want to put it in the wordsof Yoda. You can say this, this is our there is no if only how, and in this case, I'm quoting ouruser Simon Blomberg. So very briefly, in some, here's why I want to introduce you to our numberone, because r is the language of data science, .
Because it's free, and it's open source.And because of the free packages that you can download, install r makes it possible to donearly anything when you're working with data. So I'm really glad you're here. And then I'll havethis chance to show you how you can use R to do your own work with data in a more productive,more interesting and more effective way. Thanks for joining me. The first thing that we need todo for our an introduction is to get set up. More specifically we need to talk about installingare, the way you do this is you can download it, you just need to go to the home page for the ourproject for statistical computing. And that's at our dash project.org. When you get there, youcan click on this link in the first paragraph that says download our and then I'll bring you tothis page that lists all the places that you can .
Download it. Now I find the easiest is to simplygo to this top one, this has cloud because that'll automatically direct you to whichever of thebelow mirrors is best for your location. When you click on that, you'll end up at this page,the comprehensive our archive network, or CRAN, which we'll see again, in this course, you need tocome here and click on your operating system. If you're on a Mac, it'll take you to this page.And the version you're going to want to click on is just right here, it's a package file,that's a zipped application installation file, click on that, download it and follow thestandard installation directions. If you're on a Windows PC, then you're probably going towant this one base again, click on it, download it and go through the standard installationprocedure. And if you're on a Linux computer, .
You're probably already familiar with what youneed to do. So I'm not going to run through that. Now before we get a look at what is actuallylike when you open it, there's one other thing you need to do. And that is to get the files thatwe're going to be using in this course. on the page that you found this video, there's a linkthat says download files. If you click on that, then you'll download a zipped folder calledour oh one underscore entro underscore files, download that unzip it. And if you want to putit on your desktop, when you open it, you're going to see something like this a single folderthat's on your desktop. And if you click on it, then it opens up a collection of scripts. Thedot r extension is for an R source or a script file. I also have a folder with a few data filesthat we'll be using in one of these videos. If .
You simply double click on this first file, whosefull name is this, that'll open up in our and let me show you what that looks like. When you openup the application Are you will probably get a setup of windows that look like this. On the leftis the source window or the script window where you actually do your programming. On the right isthe console window that shows you the output and right now it's got a bunch of boilerplate text.Now coming over here again on the left, any line that begins with a pound sign or hashtag or akaThorpe is a commented line. That's not right. On these other lines or code that can be run, by theway, you may notice a red warning just popped up on the right side, that's just telling us aboutsomething that has to do with changes in our and it doesn't affect us. What I'm going to do righthere is I'm going to put the cursor in this line, .
And then I'm going to hit Command or Control andthen enter, which will run that line. And you can see now, that is opened up over here. And whatI've done is I've made available to the program, a collection of data sets. Now I'm going to pickone of those data sets is the iris data sets very well known as a measurement of three species ofthe iris flower. And we're going to do head to see the first six lines. And there we have the sepallength, sepal width, petal length and petal width of in this case, it's also Tosa. But if you wantto see a summary of the variables, get some quick descriptive statistics, we can run this next lineover here. And now I get the quartiles. The mean, as well as the frequency of the three differentspecies of Iris, on the other hand, is really nice to get things visually. So I'm going to runthis basic plot command for the entire dataset. .
And it opens up a small window, I'mgonna make it bigger. And it's a scatterplot of the measurements or the three kindsof viruses, as well as a funny one where it's including the three different categories, they'regonna close that window. And so that is basically what our looks like and how our works in itssimplest possible version. Now, before we leave, I'm actually going to take a moment to clean upthe application in the memory, I'm going to detach or remove the datasets package that I added.I already closed the plot. So I don't need to do this one separately. But what I can do is comeover here to clear the console, I'm actually going to come up to edit and come down to clear console.And that cleans it out. And this is a very quick run through of what our looks like in its nativeenvironment. But in the next movie, I'm going to .
Show you another application we can install calledour studio that lays on top of this, and makes interacting with our a lot easier and a lot moreorganized and really a lot more fun to work with. The next step and are an introduction andsetting up is about something called our studio. Now. This is our studio. And what it isis a piece of software that you can download, in addition to our what you've already installed,and its purpose is really simple. It makes working with our easier. Now there's a few different waysthat it does is number one is it has consistent commands. What's funny is, the different operatingsystems have slightly different keyboard commands for the same operations. And our, our studiofixes that. And it makes it the same whether you're on Mac, Windows or Linux. Also, there'sa unified interface instead of having two, .
Three or 17. windows open, you have one windowwith the information organized, and also makes it really easy to navigate with the keyboardsand to manage the information that you have in our and let me show you how to do this. But firstwe have to install it, where you're going to need to do is to go to our studios website, which is atour studio.com. From there, click on download our studio. Now bring it to this page or somethinglike it. And you're going to want to choose the desktop version. Now, when you get there,you're going to want to download the free sort of community version as opposed to the $1,000 a yearversion. And so click here on the left. And then you're going to come to the list of installers forsupported platforms, it's down here on the left, this is where you get to choose your operatingsystem. Click the top one if you have windows. The .
Next one if you have a Mac and then we have lotsof different versions of Linux, whichever one you get, click on it, download it and go through thestandard installation process, then open it up. And then let me show you what it's like workingin our studio. To do this, open up this file and we'll see what it's like in our studio. When youopen up our studio, you get this one window that has several different panes in it. At the top, wehave the script or the source window. And this is where you do your actual programming. And you'llsee that it looks really similar to what we did when I opened up the our application. The color isa little different. But that's something that you can change in preferences or options. The consoleis down here at the bottom. And that's where you get the text output. Over here is the environmentthat saves the variables if you're using any and .
Then plots and other information show up herein the bottom right. Now you have the option of rearranging things and changing what's thereas much as you want. Our studio is a flexible environment. And you can resize things by simplydragging the divider between the areas. So let me show you quick example, using the exact same codethat I did in my previous example. So you can see how it works in our studio as opposed to theregular our app that we use first time. First, I'm going to load some data, that's by usingthe datasets package, I'm going to do a Command or Ctrl N, enter to load that one. And youcan see right here, it's run the command. And then I want to do the quick summary ofdata I'm going to do head Irish shows the first six lines. And then here it is down here,I can make that a little bit bigger if I want. .
Then I can do a summary by just coming backhere, and clicking Command or Control Enter. And actually, I'm going to do a keyboard commandto make the console bigger now. And then we can see all of that, I have the same basic descriptivestatistics and the same frequencies there. And go back to how it was before. And make this bringthis one down a little. And now we can do the plot. Now this time, you see it shows up in thiswindow here on the side, which is nice. It's not a standalone window. Let me make that one bigger,it takes a moment to adjust. And there we have the same information that we had in the our app.Right here, it's more organized in a cohesive environment. And you see that I'm using keyboardshortcuts to move around. And it makes life really easy for dealing with the information that I havein our I'm going to do the same cleanup, I'm going .
To detach the package that I had, this is actuallya little command to clear the plots. And then here in our studio, I can run a funny little commandthat'll do the same as doing Ctrl l to clear the console for me. And that is a quick run throughof how you can do some very basic coding in our studio, again, which makes working with our moreorganized more efficient and easier to do overall. In our very basic introduction to our and settingup, there's one more thing I want to mention that makes working with are really amazing. And that'sthe packages that you can download install. Basically, you can think of them as giving youhave superpowers when you're doing your analysis, because you can basically do anything withthe packages that are available. Specifically, packages are bundles of code. So it's moresoftware that add new function to our makes it .
So we can do new things. Now, there are two kindsof package two general categories. There are base packages, these are packages that are installedwith our so they're already there. But they're not loaded by default. That way, our doesn't usemaybe as much memory as it might otherwise. But more significant than that are the contributedor third party packages. These are packages that need to be downloaded, installed, and thenloaded separately. And when you get those, it makes things extraordinary. And so you mayask yourself, where to get these marvelous packages that make things so superduper?Well, you have a few choices. Number one, you can go to CRAN. That's the comprehensiveour archive network, that's an official, our site that has things listed with the officialdocumentation, too, you can go to a site called .
CRAN tastic, which really is just a way of listingthese things. And when you click on the links, it redirects you back to CRAN. And then third,you can also get our packages from GitHub, which is an entirely different process. Ifyou're familiar with GitHub, it's not a big deal. Otherwise, you don't usually need to dealwith it. But let's start with this first one, the comprehensive our archive network,or CRAN. Now, we saw this previously, when we were just downloading our This time,we're going to CRAN dot r dash project.org. And we're specifically looking for this one,the CRAN packages, that's gonna be right here on the left click on packages. And when you openthat, you're gonna have an interesting option. And that's to go to task views. And that breaksit down by topic. So we have here, packages that .
Deal with Bayesian inference packages that dealwith chemo metrics, and computational physics, so on and so forth. If you click on any one ofthose, it'll give you a short description of the packages that are available and what they'redesigned to do. Now another place to get packages, I said, is CRAN tastic, at CRAN tastic.org. Andthis is one that lists the most recently updated the most popular packages. And it's a nice way ofgetting some sort of information about what people use most frequently, although it does redirectyou back to CRAN to do the actual downloading. And then email@example.com if you go to slashtrending slash R, you'll see the most common are most frequently downloaded packages on GitHubfor use and are now regardless of how you get it, let me show you the ones that I use most often andI find these Make working with are really a lot .
More effective and what easier. Now they have kindof cryptic names. The first one is d plier, which is for manipulating data frames, then there'stidy or for cleaning up information, stringer for working with strings or text information.lubra date for manipulating date information. h TT er for working with website data. GG ww iswhere the GG stands for grammar of graphics. This is for interactive visualizations. GG, plot twois probably the most common package for creating graphics or data visualizations in our SHINee isanother one that allows you to create interactive applications that you can install on websites.reo is for our input output is for importing and exporting data. And then our markdown allows youto create what are called interactive notebooks or rich documents for sharing your information. Now,there are others, but there's one in particular, .
That thing's useful. I call it the one packageto load them all. And it's Pac Man, which not surprisingly, stands for package manager. AndI'm going to demonstrate all of these in another course that we have here. But let me show you veryquickly how to get them working. He just tried an R. If you open up this file from the coursefiles, let me show you what it looks like. What we have here in our studio is the file for thisparticular video. And I say that I use Pac Man, if you don't have it installed already, then runthis one installation line. This is the standard installation command in R. And now add Pac Man,and then it will show up here and packages. Now I already have it installed. So you can see itright there. But it's not currently loaded. See because installing means making it availableon your hard drive. But loading means actually .
Making it accessible to your current routines.So then I need to load it or import it. And I can do it with one of two ways. I can use therequire, which gives a confirmation message, I can do it like this. And you see it's got thatlittle sentence there. Or I can do library which simply loads it without saying anything. Youcan see now by the way that it's checked off, so we know it's there. Now, if you havePac Man installed, even if it's not loaded, then you can actually use Pac Man to install otherpackages. So what I actually do is because I have Pac Man installed, I just go straight to this oneyou do Pac Man and then the two colons. It says, use this command, even though this package isn'tloaded. And then I load an entire collection, all the things that I showed you starting with PacMan itself. So now I'm going to run this command. .
And what's nice about Pac Man is, if you don'thave the package, it will actually install it, make it available and load it. And I gotta tellyou, this is a much easier way to do it than the standard r routine. And then, for base packages,that means the ones that come with are natively like the data sets package, you still want to doit this way you load and unload them separately. So now I've got that one available. And then I cando the work that I want to do. Now I'm actually not going to do it right now, because I'm going toshow it to you in future videos. But now I have a whole collection of packages available, they'regoing to give me a lot more functionality and make my work more effective. I'm going to finishby simply unloading what I have here. Now if you want to with Pac Man, you can unload specificpackages, or the easiest way is to do p underscore .
Unload all. And what that does is it unload allof the add on or contributed third party packages. And you can see I've got the full listhere of what is unloaded. However, for the base packages like data sets, youneed to use the standard r command detach, which I'll use right here. And then I'll clearmy console. And that's a very quick run through of how packages can be found online installed intoour and loaded to make your code more available. And I'll demonstrate how those work in basicallyevery video from here on out. So you'll be able to see how to exploit their functionality tomake your work a lot faster and a lot easier. Probably the best place to start when you'reworking with any statistics program is basic graphics so you can get a quick visual impressionof what you're dealing with. And the command and .
Are the next simplest of all, is the default plotcommand is also known as basic x, y plotting for the x and y axes on a graph. And what's neat aboutRS plot command is that it adapts to data types and to the number of variables that you're dealingwith. Now, it's going to be a lot easier for me to simply show you how this works. So let's try itin our just open up the script file and we'll see how we can do some basic visualizations in ourThe first thing that we're going to do is load some data Data Sets from the data sets packagethat comes with our, we simply do library data sets. And that loads it up, we're gonna use theiris data, which I've showed you before. And you'll get to see many more times. Let's lookat the first few lines. I'll zoom in on that. And what this is, is the measurement of the Siebelm petal length and width for three species of .
Viruses is a very famous data set about 100 yearsold. And it's a great way of getting a quick feel for what we're able to do and are, I'll come backto the full window here. And what we're going to do is first get a little information about theplot command to get help on something in our just do the question mark, and the thing you wanthelp for. Now we're in our studio. So this opens up right here in the help window. And you seewe've got the whole set of information here, all the parameters and additional links, you canclick on and then examples here at the bottom. I'm going to come over here and I'm going to usethe command for a categorical variable first. And that's the most basic kind of data that wehave. And so species, which is three different species is what I want to use right here. So I'mgoing to do plot, and then in the parentheses, .
You put what it is you want to plot. And what I'mdoing here is I'm saying it's in the data set, Iris, that's our data frame, actually. And thenthe dollar sign says use this variable that's in that data. So that's how you specify the wholething. And then we get an extremely simple three bar chart, I'll zoom in on it. And what it tellsyou is that we have three species of Iris setosa, versicolor, and virginica, and then we have 50 ofeach. And so it's nice now that we have balanced group that we have three groups because thatmight affect some of the analyses that you do. And it's an extremely quick and easy way tobegin looking at the data all zoom back out. Now let's look at a quantitative variable, soone that's on an interval or nominal level of measurement. For this one, I'll do petal length.And you see I do the same thing plot and then Iris .
And then peddling. Please note I'm not tellingare that this is now a quantitative variable. On the other hand, it's able to figure that one outand by itself. Now, this one's a little bit funny, because it's a scatterplot, I'm going to zoom inon it. But the x axis is the index number or the row number in the dataset. So that one's reallynot helpful. It's the variable that's going on the Y, that's the petal length that you get to see thedistribution. On the other hand, you know that we have 50 of each species. And we have the setosa.And then we have the versicolor. And then we have the virginica. And so you can see that thereare group differences on these three things. Now, what I'm going to do is I'm going toask for a specific kind of plot to break it down more explicitly between the two categories.That is, I'm going to put in two variables now, .
Where I have my categorical species, andthen a comma, and then the petal length, which is my quantitative measurement. I'mgoing to run that again, you just hit Ctrl, or command and Enter. And this is one that I'mlooking for here. Let's zoom in on that. Again, you see that it's adapted. And it knows, forinstance, that the first variable I gave it is categorical. The second was quantitative, andthe most common chart for that is a box plot. And so that's what it automatically chooses to do.And you can see, it's a good plot here, we can see very strong separation between the groups onthis particular measurement. I'll zoom back out. And then let's try a quantitative pair. So nowI'll do petal length and petal width, so it's gonna be a little bit different. I'll run thatcommand. And now this one is a proper scatterplot, .
Where we have a measurement across the bottom,and a measurement of the side. But you can see that there's a really strong positive associationbetween these two. So not surprisingly, as a petal gets longer, it generally also gets wider, so itjust gets bigger overall. And then finally, if I want to run the plot command on the entire dataset the entire data frame, this is what happens, we do plot and then Iris. Now we've seen this onein previous examples, but let me zoom in on it. And what it is, is an entire matrix of scatterplots of the four quantitative variables. And then we have species, which is kind of funny becauseit's not labeling them. But it shows us a dot plot for the measurements of each species. Andthis is a really nice way if you don't have too many variables of getting a very quick holisticimpression of what's going on in your data. And so .
The point of this is that the default plot commandis able to adapt to the number of variables I gave it, and to the kind of variables I give it, andit makes life really easy. Now, I want you to know that it's possible to change the way thatthese look. I'm going to specify some options. I'm going to do the plot again, this scatterplotwhere I say plot, and then in parentheses, I give these two arguments, or saying what Iwant in it, I'm gonna say, do the petal length, and do the petal width. And then I'm gonna goto another line, I'm just separating with comma. Now if you want to, you can write this all as onereally long line, I break it up, because I think it makes a little more readable. I'm going tospecify the color, a new with call for color, and then I use a hex code. And that code is actuallyfor the red that is used on the data lab homepage. .
And then PCH is four point character, and thatis a 19 is a solid circle. Now I'm going to main title on it, and then I'm gonna put a label on thex axis and a label on the y axis. So I'm actually going to run those now by doing Command or ControlEnter for each line, and you can see it builds up. And when we finished, we got the whole thing, I'llzoom in on it again. And this is the kind of plot that you could actually use in a presentationor possibly in a publication. And so even what the base command, we're able to get reallygood looking, informative and clean graphs. Now, what's interesting is that the plot commandcan do more than just show data, we can actually feed it in formulas, if you want, for instance,to get a cosine, I do plot and then coast is for cosine. And then I give the limit, I go fromzero to two times pi, because that's relevant for .
Cosine. I click on that, and you can see the graphthere, it's doing our little cosine curve, I can do an exponential distribution from one to five.And there it is curving up. And I can do D norm, which is for a density of a normal distributionfrom minus three to plus three. And there's the good old bell curve there in the bottom right.And then we can use the same kind of options that we used earlier for our scatterplot. Hereto say, do a plot of D norm, so the bell curve from minus three to plus three on the x axis. Andnow we're going to change the color to red l WD is for linewidth, make it thicker, give it a title onthe top, a label on the x axis and a label on the y axis. We'll zoom in on that. And so there is mynew and improved prettier and presentation ready bell curve that I got with a default plot, commandand R. And so this is a really flexible and .
Powerful command. Also, it's the base package. Andyou'll see that we have a lot of other commands that can do even more elaborate things. But thisis a great way to start and get a quick impression of your data, see what you're dealing with, andshape the analyses that you do subsequently. The next step in our introduction, and ourdiscussion of basic graphics, is bar charts. And the reason I like to talk about bar chartsis this, because simple is good. And when it comes to bar charts, bar charts are the most basicgraphic for the most basic data. And so they're a wonderful place to start in your analysis. Let meshow you how this works. Just try it in our open up this script. And let's run through and see howit works. When you open up the file in our studio, the first thing we're going to want to do iscome down here and open up the datasets package. .
And then we're going to scroll down a little bitand we're going to use a dataset called empty cars. Let's get a little bit of information aboutthis do the question mark and the name of the data set. This is Motor Trend. That's a magazinecar road test from 1974. So you know they're 42 years old. Let's take a look at the first fewrows of what's in empty cars by doing head. I'm going to zoom in on this. And what you can see isthat we have a list of cars the Mazda RX four and the wagon the Datsun 710, the AMC Hornet and Iactually remember these cars and we have several variables on each of them we have the mpg MPG,we have the number of cylinders the displacement and cubic inches, the horsepower the final driveratio which has to do with the axle, and then we have the weight in tons the quarter mile time inseconds. And these are a bunch of really really .
Slow cars. V S is for whether the cylinders are ina V, or whether they are in a straight or in line. And then the am is for automatic or manual. Thenwe go into the next line we have gear which is the number of gears in the transmission and carb forhow many carburetor barrels they have, which is we don't even use carburetors anymore. Anyhow. Sothat's what's in the data set. I'll zoom back out. Now if we want to do a really basic bar chart,you might think that the most obvious thing to do would be to use RS bar plot command. That's,it's named for the bar chart. And then to specify the data set empty cars, and then the dollar sign,and then the variable that we want cylinders. So you think that would work, but unfortunately,it doesn't. Instead, what we get is this, which is just kind of going through all thecases on a one by one by one row and telling .
Us how many cylinders are in that case, that'snot a good one. That's not what we want. And so what we need to do is we actually need toreformat the data a little bit, by the way, you would have to do the exact same thing, ifyou wanted to make a bar chart in a spreadsheet, like Excel or Google Sheets, you can't do it withthe raw data, you first need to create a summary table. And so what we're going to do here is we'regoing to use the command table, we're gonna say, take this variable from this data set and make atable of it, and feed it into an object, you know, a data thing, data container called cylinders,I'm going to run that one. And then you see that just showed up in the top left, let me zoom inon that one. So now I have in my environment, a data object called cylinders, it's a table,it's got a length of three, it's got a size of .
1000 bytes, and it gives us a little bit moreinformation. Let's go back to where we were. But now I've saved that information intocylinders, which just has the number of cylinders, I can run the bar plot command. And now I getthe kind of plot I expected to see. From this, we see that we have a fair number of cars with fourcylinders, a smaller number was six. And because this is in 74, we've done a lot of eight cylindercars in this particular data set. Now, we can also use the default plot command, which I showed youpreviously, on the same data, we're just going to do something a little different, it's actuallygoing to make a line chart where the lines are the same length of each bars, I'd probably use the barplot instead, because it's easier to tell what's going on. But this is a way of making a defaultchart that gives you the information you need .
For the categorical variables. Remember, simpleis good. And that's a great way to start. In our last video, on basic graphics, we talked about barcharts. If you have a quantitative variable, then the most basic kind of chart is a histogram. Andthis is for data that is quantitative or scaled or measured, or interval or ratio level, all of thoseare referring to basically the same thing. And in all of those, you want to get an idea of whatyou have. And a histogram allows you to see what you have. Now there's a few things you're goingto be looking for with a histogram. Number one, you're going to be looking for the shape of thedistribution, is it symmetrical, is it skewed is a uni modal by modal, you're going to look for gapsor big empty spaces in the distribution. You're also going to look for outliers, unusual scores,because those can distort any of your subsequent .
Analyses. He'll look for symmetry to see whetheryou have the same number of high and low scores or whether you have to do some sort of adjustmentto the distribution. But this is going to be easier if we just try it in R. So open up this Rscript file. And let's take a look at how we can do histograms in R. When you open up the file, thefirst thing we need to do is come down here and load the data sets. We'll do this by running thelibrary command, I just do Ctrl or Command Enter. And then we can do the iris data set. Again, we'velooked at it before. But let's get a little bit of information from it by asking for help on Iris.And there we have Edgar Anderson's Iris data, also known as Fisher's Iris data, because hepublished an article on it. And here's the full set of information available on it from 1936. Soit's 80 years old. Let's take a look at the first .
Few rows. Again, we've seen this before, Siebeland petal length and width for three species of Iris. We're gonna do a basic histogram on thefour quantitative variables that are in here. And so I'm going to use just the hist command.So hist and then the dataset Iris and then the dollar sign to say which variable and then Siebeldot length. I run that I get my first histogram. Let's zoom in on a little bit. And what happenshere is of course, it's a basic sort of black line on white background, which is fine for exploratorygraphics. And it gives us a default title that says histogram of the variable and it gives us thethe clunky name which is also on the x axis on the bottom, it automatically adjusts the x axis and itchooses about seven or nine bars, which is usually the best choice for a histogram. And then on theleft, it gives us the frequency or the count of .
How many offs revisions are in that group. Sofor instance, we have only five irises whose sepal length is between four and four and a halfcentimeters, I think it is. Let's zoom back out. And let's do another one. Now, this time fora simple width, you can see that's almost a perfect bell curve. And we do petal length, we getsomething different. Let me zoom in on that one. And this is where we see a big gap, we've got areally strong bar there at the low end. In fact, it goes above the frequency axis. And then wehave a gap. And then sort of a bell curve that lets us know that there's something interestinggoing on with the data that we're going to want to explore a little more fully. And thenwe'll do another one for petal width, I'll just run this command. And you can seethe same kind of pattern here where there's .
A big clump at the low end, there's a gap. Andthen there's sort of a bell curve beyond that. Now, another way to do this is to do thehistograms by groups. And that would be an obvious thing to do here, because we have threedifferent species of Iris. So what we're going to do here is we're going to put the graphs intothree rows, one above another in one column. I'm going to do this by changing a parameter pa RSfor parameter, and I'm giving it the number of rows that I want to have in my output. AndI need to give it a combination of numbers, I do this C, which is for concatenate, it meanstreat these two numbers as one unit, where three is the number of rows, and then the one is thenumber of columns. So I run that it doesn't show anything just yet. And then I'm going to come downand I'm going to do this more elaborate command, .
I'm going to do hist. That's the histogram thatwe've been doing. I'm going to do petal length, except this time in square brackets, I'm going toput a selector is this means use only these rows. And the way I do this is by saying I want to do itfor this atossa irises. So I say, Iris, that's the data set, and then dollar sign. And then speciesis the variable. And then two equals because in computers, that means is equivalent to and thenin quotes, and they have to spell it exactly the same with the same capitalization and do setosa.So this is the variable and the row selection. I'm also going to put in some limits for thex, because I want to manually make sure that all three of the histograms I have have the samex scale. So I'm going to specify that breaks is for how many bars I wanted the histogram. And andactually, what's funny about this is it's really .
Only a suggestion that you give to the computer,then I'm going to put a title above that one, I'm going to have no x label, and I'm going tomake it read somebody would do all of that right now. I'll just run each line. And then you seeI have a very skinny chart, let's zoom in on it. So it's very short. But that's because I'm gonnahave multiple charts, it's gonna make more sense when we look at them all together. But you can seeby the way that the petal width for this atossa irises is on the low end. Now let's do the samething for versicolor. I'm going to run through all that. It's all gonna be the same, except we'regonna make it purple. There's versicolor. And then let's do virginica last. And we'll makethose blue. And now I can zoom in on that. And now when we have our three histograms, it'sthe same variable petal width, but now I'm doing .
It separately for each of the three species. Andit's really easy to see what's going on here. Now. setosa is really low versicolor and virginicaoverlap, but they're still distinct distributions. This approach, by the way, is referred to as smallmultiples, making many versions of the same chart on the same scale. So it's really easy to compareacross groups are across conditions, which is what we're able to do right here. Now, by the way,anytime you change the graphical parameters, you want to make sure to change them back to whatthey were before. So here, I'm going par, and then going back to one column and one row. And that'sa good way of doing histograms for examining quantitative variables, and even for exploringsome of the complications that can arise when you have different categories with different scoreson those variables. In our two previous videos, .
We looked at some basic graphics for one variableat a time, we looked at bar charts for categorical variables, and we looked at histograms forquantitative variables. While there's a lot more you can do with univariate distributions. You alsomight want to look at by various distributions, we're gonna look at scatter plots as the mostcommon version of that you do a scatter plot when what you want to do is visualize the associationbetween two quantitative variables. Now, I actually know it's more flexible than that. Butthis is the canonical case for a scatterplot. And when you do that, what sorts of things do you wantto look for in your scatterplot? I mean, there's a purpose in it. Well, number one, you want to seeif the association between your two variables is linear, or if it can be described by a straightline, because most of the procedures that we do .
Assume linearity. You also want to check if youhave consistent spread across the scores as you go from one end to the x axis to another, because ifthings fan out considerably, then you have what's called heteroscedasticity. And it can reallycomplicate some of the other analyses. As always, you want to look for outliers, because an unusualscore, or especially an unusual combination of scores, can drastically throw off some of yourother interpretations. And then you want to look for the correlation is there an associationbetween these two variables. So that's what we're looking for it, let's try it in our simply open upthis file, and let's see how it works. The first thing we need to do in our is come down and openup the datasets package just to command or control and Enter. And we'll load the data sets, we'regoing to use empty cars, we looked at that before, .
It's got a little bit of information, it's roadtest data from 1974. And let's look at the first few cases. I'll zoom in on that. Again, we havemiles per gallon cylinders, so on and so forth. Now, anytime you're going to do an association,it's a really good idea to look at the univariate or one variable at a time distributions as well,we're going to look at the association between weight and mpg. So let's look at the distributionfor each of those separately. I'll do that with a histogram, I do hist. And then in parentheses,I specify the data set empty cars in this case, and then $1 sign to save which variable in thatdata set. So there's the histogram for weight. And you know, it's not horrible there, it lookslike we've got a few on the high end there. And here's the histogram for miles per gallon. Again,mostly kind of normal, but a few on the high end. .
But let's look at the plot of the two of themtogether. Now, what's interesting is I just use the generic plot command, I feed that in, and ris able to tell that I'm giving it to quantitative variables, and that a scatterplot is the bestkind of plot for that. So we're gonna do weight and mpg. And then let me zoom in on that. Andwhat you see here is one circle for each car at the joint position of its weight and its MPG, andit's a strong downhill pattern. Not surprisingly, the more a car weighs and we have somein this data set that are five tonnes, the lower miles per gallon, we have get down toabout 10 miles per gallon here, the smallest cars, which appear to weigh substantially undertwo times get about 30 miles per gallon. Now, this is probably adequate for most purposes.But there's a few other things that we can do. So .
For instance, I'm going to add some colors here,I'm going to take the same plot, and then add on additional arguments or say, use a solid circlepchs for point character 19 as a solid circle, c x has to do with this size of things, and I'mgoing to make in the 1.5 means making 150% larger call is for color and I'm specifying a particularread the one for data lab in hex code, I'm going to give a title, I'm going to give an X label anda y label. And then we'll zoom in on that. And now we have a more polished chart that also because ofthe solid red circles makes it easier to see the pattern that's going in there, where we got somereally heavy cars with really bad gas mileage, and then almost perfect linear association up tothe lighter cars was much better gas mileage. And so a scatterplot is the easiest way of looking atthe association between two variables, especially .
When those two variables are quantitative.So they're on a scaled or measured outcome. And that's something that you want to do anytimeyou're doing your analysis to first visualize it, and then use that as the introduction to anynumerical or statistical work you do after that, as we go through are necessarily very shortpresentations on basic graphics. I want to finish by saying one more thing, and that isyou have the possibility of overlaying plots. And that means putting one plot directly on top ofor superimposing it on another. Now, you may ask yourself why you want to do this Well, I can giveyou an artistic version on this. This, of course, is Pablo Picasso's Les Demoiselles d'Avignon. Andit's one of the early masterpieces in Cubism and the idea of Cubism is it gives you many views,or it gives you simultaneously several different .
Perspectives on the same thing. And we're gonnatry to do a similar thing with data. And so we can say very quickly. Thanks, Pablo. Now, whywould you overlay plots, really, if you want the technical explanation is because you get increasedinformation density, you get more information, and hopefully more insight in the same amount ofspace and hopefully the same amount of time. Now, there is a potential risk here. You mightbe saying to yourself at this point, well, you want dense, guess what? I can do dance. Andthen we end up with something vaguely like this, the Garden of Earthly Delights, and it'scompletely overwhelming, and it just makes you kind of shut down cognitively. No, thankyou. Hieronymus Bosch. No, I instead, well, I like Hieronymus Bosch his work. And to tell youwhen it comes to data graphics use restraint. Just .
Because you can do something doesn't mean that youshould do that thing. When it comes to graphics and overland plots, the general rule is this, useviews that complement and support one another that don't compete. But that gives greater informationin a coherent and consistent way. This is going to make a lot more sense. If we just take alook at how it works in our so open up this script. And we'll see how we can overlay plots forgreater information density and greater insight. The first thing that we're going to needto do is open up the datasets package. And we're going to be using a data set we haven'tused before about lynxes, that's the animal. This is about Canadian Lynx trappings from 1821 to1934. If you want the actual information on the dataset, there it is. Now let's take a look atthe first few lines of data. This one is a time .
Series. And so what's unusual about it is thisis just one line of numbers. And you have to know that it starts at 1821. And it goes through. Solet's make a default chart with a histogram. As a way you've seen, or links trappings consistentor how much variability was there, we'll do hist, which is the default histogram. And we'll simplyput links in, we don't have to specify variables, because there's only one variable in it. And whenwe do that, I'll zoom in on that, we get really a skewed distribution, most of the observations aredown at the low end, and then it tapers off to it's actually measured in 1000s. So we can tellthat there is a very common value, it's at the low end. And then on the other hand, we don't knowwhat years those were. So we're ignoring that for just a moment and taking a look at the overalldistribution of trappings, regardless of yours, .
Miss zoom back out. And we can do some optionson this one to make it a little more intricate, we can do a histogram. And then in parentheses, Ispecify the data. I also can tell it how many bins I want. And again, it sort of is suggestingthat because r is going to do what it wants Anyhow, I can say make it a density instead offrequency. So it'll give proportions of the total distribution. We'll change the colors to call thesisal one because you can use color names. And our will give it a title here. By the way, I'm usingthe paste command because it's a long title, and I want it to show up on one line, but Ineed to spread my command across two lines, you can go longer, I have to use a shortcommand line. So you can actually see what we do when we're zoomed in here. So there's thatone, and then we're going to give it a label, .
This has number of links trapped. And now we havea more elaborate chart. I'll zoom in on it, and it's a kind of little thistle purple lilac color.And we have divided the number of bins differently previously, it was one bar for every 1000. Nowit's one bar for 500. But that's just one chart. We're here to see how we can overlay charts anda really good one anytime you're dealing with a histogram is a normal distribution. So you want tosee are the data distributed normally now we can tell they're skewed here, but let's get an idea ofhow far they are from normal. To do this, we use the command curve. And then D norm is for densityof the normal distribution. And then here I tell it axes you know just a generic variable name,but I tell it use the mean of the Lynx data. Use the standard deviation of the Lynx data We'll makeit a slightly different fissel color. Number four, .
We'll make it two pixels wide, the line widthis two pixels and then add says stick it on the previous graph. And so now I'll zoom in on that.And you can see if we had a normal distribution with the same mean and standard deviation asthis data, it would look like that. Obviously, that's not what we have, because we havethis great big spike here on the low end, then I can do a couple of other things, Ican put in what are called kernel density estimators. And those are sort of like a bellcurve, except they're not parametric, instead, they follow the distribution of the data, thatmeans they can have a lot more curves in them, they still add up to one like a normaldistribution. So let's see what those would look like here, we're gonna do lines. That'swhat we use for this one. And then we say density, .
That's going to be the standard kerneldensity estimator, we'll make it blue. And there it is, on top, I'm going to doone more than we'll zoom in, I can change a parameter of the kernel density estimator,here, I'm using a just to say, average across it sort of like a moving average, average acrossa little more. And now let me zoom in on that. And you can see, for instance, the blue linefollows the spike at the low end a lot more closely than it dips down. On the other hand,the purple line is a lot more slower to change, because of the way I gave it his instructionswith the Adjust equals three. And then I'm going to add one more thing, something called a rugplot, it's a little vertical lines underneath the plot for each individual data point. AndI do that with rug. And I say just use links, .
And then we're gonna make it a line width orpixel width of two, and then we'll make it gray. And that, and assuming is our final plot, you cansee now that we have the individual observations marked, and you can see why each bar is as tall asit is and why the kernel density estimator follows the distribution that it does. This is our finalhistogram with several different views of the same data. It's not Cubism, but it's a great way ofgetting a richer view of even a single variable that can then inform the subsequent analyses youdo to get more meaning and more utility out of your data. Continuing in our an introduction,the next thing we need to talk about is basic statistics. And we'll begin by discussing thebasic summary function in our The idea here is that once you have done the pictures that you'vedone that basic visualizations, then you're going .
To want to get some precision by getting numericalor statistical information. Depending on the kinds of variables you have, you're going to wantdifferent things. So for instance, you're going to want counts or frequencies for categories.They're going to want things like core titles and the mean for quantitative variables. We cantry this in our and you'll see that it's a very, very simple thing to do. Just open up this scriptand follow along. What we're going to do is load the data sets package, controller command andthen enter. And we're actually going to look at some data and do an analysis that we've seenseveral times already, we're going to load the iris data. And let's take a look at the firstfew lines. And again, this is for quantitative measurements on the seaplane petal lengthand width are three species of Iris flowers. .
And what we're going to do is we're going toget summary in three different ways. First, we're going to do summaryfor a categorical variable. And the way we do this is we use the summaryfunction. And then we'd say Iris, because that's the data set and then $1 sign and then the name ofthe variable that we want. So in this case, it's species, we'll run that command. And you can seeit just has setosa 50 versicolor 50 and virginica 50. And those are the frequencies are the countsfor each of those three categories in the species variable. Now we're going to get somethingmore elaborate for the quantitative variable, we'll use sepal length for that one, and I'll justrun that next line. And now you can see it lays it out horizontally, we have the minimum valueof 4.3, then we have the first quartile of 5.1, .
The median than the mean than the third quartileand then the maximum score of 7.9. And so this is a really nice way of getting a quick impressionof the spread of scores. And also by comparing the median and the mean sometimes you can tell whetherit's symmetrical or there skewness going on. And then you have one more option and that is gettinga summary for the entire data frame or data set. at once, and what I do is I simply do summaryand then in the parentheses for the argument, I just give the name of the dataset IRS. Andthis one, I need to zoom in a little bit, because now it arranges it vertically. Where dowe do sepal length. So that's our first variable, and we get the courthouse and we get the median.And we do Siebel with petal length, petal width, and then it switches over at the last one specieswhere it gives us the counts or frequencies of .
Each of those three categories. So that's the mostbasic version of what you're able to do with the default summary variable in R gives you quickdescriptives gives you the precision to follow up on some of the graphics that we did previously.And it gets you ready for your further analyses. As you're starting to work with R, and you'regetting basic statistics, you may find you want a little more information than the base summaryfunction gives you. In that case, you can use something called describe, and its purposeis really easy. It gets more in detail. Now, this is not included in ours basic functionality.Instead, this comes from a contributed package, it comes from this psych package. And when yourun describe from site, this is where you're going to get, you'll get n that's the sample size, themean, the standard deviation, the median, the 10%, .
Trimmed mean, the median absolute deviation, theminimum and maximum values, the range skewness, and kurtosis, and standard errors. Now, don'tforget, you still want to do this after you do your graphical summaries pictures firstnumbers later. But let's see how this works in our simply open up this script, and we'll runthrough it step by step. When you open up are, the first thing we're going to need to do iswe're going to need to install the package. Now, I'm actually going to go through my defaultinstallation of packages, because I'm going to use one of these Pac Man. And this just makes thingsa little bit easier. So we're going to load all these packages. And this assumes, of course, youhave Pac Man installed already, we're going to get the data sets. And then we'll load our Iris data.We've done that lots of times before sepal, and .
Petal length and width and the species. But nowwe're going to do something a little different, we're going to load a package, I'm using p loadfrom the Pac Man package. That's That's why I loaded it already. And this will download it ifyou don't have it already, it might take a moment. And it downloads a few dependencies, generallyother packages that need to come along with it. Now, if you want to get some help on it, you cando p anytime you have P and underscore that's something from Pac Man p help site. Now when youdo that, it's going to open up a web browser and it's going to get the PDF help. I've got it openalready because it's really big. In fact, it's 367 pages here, have documentation about the functionsinside. Obviously, we're not going to do the whole thing here. What we are going to do is we can lookat some of it in the our viewer, if you simply add .
This argument here, web equals F for false,you can spell out the word false, as long as you do it in all caps, then opens up here on theright. And here is actually this is a web browser. This is a web page we're looking at. And each ofthese, you can click on and get information about the individual bits and pieces. Now, let's usedescribe that comes from this package. It's for quantitative variables only. So you don't want touse it for categories. What we're going to do here is we're going to pick one quantitative variableright now. And that is Iris and then sepal length. When we run that one, here's what we get. NowI get a list here a line, the first number, the one simply indicates the row number, we onlyhave one row. So that's what we have anyhow. And it gives us the N of 150, the mean of 5.84, thestandard deviation, the median, so on and so forth .
Out to the standard error there at the end. Now,that's for one quantitative variable. If you want to do more than that, or especially if you want todo an entire data frame, just give the name of the data frame in describe. So here we go describeIris. I'm going to zoom in on that one because now we have a lot of stuff. Now it lists all thevariables down the side sepal length and it gives the variables numbers 12345. And it gives us theinformation for each one of them. Please note it's given us numerical information for speciesbut it shouldn't be doing that because that's a categorical variable. So you can ignore thatlast line. That's why I put an asterisk right there. But otherwise, this gives you more detailedinformation including things is like the standard deviation and the skewness that you might need.To get a more complete picture of what you have .
In your data. I use describe a lot, it's a greatway to compliment histograms and other charts like box plots to give you a more precise image ofyour data and prepare you for your other analyses. To finish up our section in our an introductionon basic statistics, let's take a short look at selecting cases. What this does is it allows youto focus your analysis, choose particular cases and look at them more closely. Now in art, you cando this a couple of different ways. You can select by category if you have the name of a category,or you can select by value on a scaled variable. Or you can select by both. Let me show you howthis works and are just open up this script and we'll take a look at how it works. As with mostof our other examples, we'll begin by loading the data sets package and by using library, just Ctrlor Command Enter to run that command that's now .
Loaded, and we'll use the iris dataset. So we'lllook at the first few cases head Iris is how we do that. Zoom in on it for a second. There's theiris data, we've already seen it several times, we'll come down and we'll make a histogram ofthe petal length for all of the irises in the data set. So I received the name of the data setand then petal length. There's our histogram off to the right, I'll zoom in on it for a second. Soyou see, of course, that we've got this group's stuck way at the left, and then we have a gapright here, then we have a pretty much normal distribution, the rest of it, I'll zoom backout, we can also get some summary statistics. I'll do that right here. For petal length, therewe have the minimum value of the core tiles and the mean. Now let's do one more thing. And let'sget the name of the species. That's going to be .
Our categorical variable and the number of casesfor of each species. So I do summary, and then it knows that this is a categorical variable.So we run it through and we have 50 of each, that's good. The first thing we're going to dois we're going to select cases by their category, in this case by the species of Iris. We'lldo this three times. We'll do it once for versicolor. So I'm going to do a histogram where Isay use the iris data. And then dollar sign means use this variable petal length. And then in squarebrackets, I put this to indicate select these rows or select these cases. And I say select when thisvariable species is equals, you got to use the two equal signs to versicolor. Make sure you spellit and capitalize it exactly as it appears in the data. Then we'll put a title on it. This sayspetal length versicolor. So here we go. And there .
Is our selected cases. This is just 50 cases goinginto the histogram. Now on the bottom right, we'll do a similar thing for virginica, where we simplychange our selection criteria from versicolor virginica. And we get a new title there. Andthen finally, we can do first atossa also. So great. That's three different histogramsby selecting values on a categorical variable, where you just type them in quotes exactly as theyappear in the data. Now, another way to do this is to select by value on a quantitative or scaledvariable. We want to do that what you do is in the square brackets to indicate you're selectingrows, you put the variable, I'm specifying that it's in the IRS data set, and then say what valueyou're selecting. I'm looking for values less than two. And I have the title chance to reflectthat. Now what's interesting is this selects .
The subtypes. This is the exact same group. And sothe diagram doesn't change. But the titles and the method of selecting the cases did. Probably moreinteresting. One is when you want to use multiple selectors. Let's look for virginica that will beour species. And we want short petals only. So this says what variable we're using petal length.And this is how we select with a Iris dollar sign species. So that tells us which variable isequal to with the two equals virginica. And then I just put an ampersand, and then say, Irispetal length is less than 5.5. Then I can run that I get my new title, and I'll zoom in on it.And so what we have here are just virginica, but the shorter ones. And so this is apair of selectors use simultaneously. Now, another way to do this, by the way, is if youknow you're going to be using the same sub sample, .
Many times, you might as well create a new dataset that has just those cases. And the way you do that is you specified the data that you'reselecting from then in square brackets, the rows and the columns, and then you use the assignmentoperator. That's the less than and dash here. What you can read as a GED So, so I'm going to createone called i dot setosa, for Iris setosa. And I'm going to do it by going to the iris data. And inspecies reading just setosa, I then put a comma, because this one selects the rows, I need to tellit which columns. If I want all of them, you just leave it blank. So I'm going to do that. And nowyou see up here in the top right, I'll zoom in on it, I now have a new object new data object. Andthe environment is a data frame called ice atossa. And we can look at that sub sample that I've justcreated, we'll get the head of just those cases. .
Now you see, it looks just the same as the otherones, except it only has 50 cases, as opposed to 150. And get a summary for those cases. And thistime, I'm doing just the petal length. And I can also get a histogram for the petal length. Andit's going to be just these two choices. And so that's several ways of dealing with sub samples.And again, saving this election, if you're going to be using it multiple times, it allows you todrill down on the data and get a more focused picture of what's going on, and helps informyour analyses that you carry on from this point. The next step in our introduction is to talkabout accessing data. And to get that started, we need to say a little bit about data formats.And the reason for that is sometimes your data, you're like talking about apples and oranges, youhave fundamentally different kinds of things. Now, .
There are two ways in particular that thiscan happen. The first one is you can have data of different types, different datatypes. And then regardless of the type, you can have your data in different structures,and it's important to understand each of these, we'll start by talking about data types.This is like the level of measurement of a variable. You can have numeric variables,which usually come in integer whole number or single precision or double precision. You canhave character variables with text in them. We don't have string variables in our they're allcharacter, you can have logical which are true, false, or otherwise called Boolean. You can havecomplex numbers, and you can have a data type raw. But regardless of which kind that you have, youcan arrange them into different data structures. .
The most common structures are vector, matrix orarray, data frame, and list, we'll take a look at each of these. A vector is one or more numbersin a one dimensional array. Imagine them all in a straight line. Now, what's interesting here isthat in other situations, if it's a single number, it would be called a scalar. But in AR, it'sstill a vector is just a vector of length one. The important thing about vectors is that the dataare all of the same data type. So for instance, all character or all integer. And you can think ofthis as ours basic data object in it, most of the things are variation of the vector. going one stepup from this is a matrix, a matrix has rows and columns, it's two dimensional data. On the otherhand, they all need to be of the same length, the columns all need to be the same length,and all the data needs to be of the same class. .
Interestingly, the columns are not named, they'rereferred to by index numbers, which can make them a little weird to work with. And then you can stepup from that into an array. This is identical to a matrix, but it's for three or more dimensions.On the other hand, probably the most common form is a data frame. This is a two dimensionalcollection that can have vectors of multiple types. You can have character variables in one,you can have integer variables, and another you can have logical and a third, the trick is,they all need to be the same length. And you can think of this as the closest thing that R hasthat's analogous to a spreadsheet. And in fact, if you import a spreadsheet, you're going to gointo a data frame, typically. Now the neat thing is that R has special functions for working withdata frames, things that you can do with those you .
Can do with others. And we'll see how those workas we go through this course and through others. And then finally, there's the list. This is ourmost flexible data format. You can put basically anything in the list. It's an ordered collectionof elements. And you can have any class, any length, any structure. And interestingly,lists can include lists include lists, and so on and so forth. So it gets like the Russian nestingdolls, you have one inside the other one inside the other. Now the trick is that may sound veryflexible and may very good. It's actually kinda hard to work with lists. And so a data framereally sort of the optimal level of complexity for a data structure. And then let me talk aboutsomething else here the idea of coercion now, in the world of ethics cores is a bad thing in theworld of data science. coercion is good. What it .
Means here is coercion is changing data objects.From one type to another, it's changing the level of measurement or the nature of the variablethat you're dealing with. So for example, you can change a character to a logical, you can changea matrix to a data frame, you can change double precision to integer, you can do any of these,it's going to be easiest to see how it works. If we go to our end, give it a whirl. So open up thisscript, and let's see how it works in our studio. For this demonstration of data types, you don'tneed to load any packages, we're just going to run through things all on their own. We'll startwith numeric data. And what I'm going to do is I'm going to create a data object a variable calledn one, my first numeric variable, and then I use the assignment operator. That's this, the littleleft arrow, and this right as n, one gets 15. Now, .
Our does double precision by default, let me dothis n one. And then you can see that it showed up here on the top right. If I call the name of thatobject, it'll show its contents in the console. So I just type n one and run that. And thereyou can see in the console at the bottom left, it brought up a one in square brackets, that'san index number for the first objects in an array. And this is an array of one number,but there it is, and we get the value of 15. Also, we can use the our command type of toget a confirmation of what type of variable that says. And it's double precision by default,we can also do another one where you do 1.5, we can get its contents 1.5. And then we seethat it also is double precision, we want to come down and do a character I'm calling that seeone for my first character variable, you see that .
I do see one the name of the object I want tocreate, I put the assignment operator the less than and dash, which is right as gets. And thenI have in double quotes. In other languages, you would do single quotes for a single character. Andyou would use double quotes for strings. They're the same thing in R, and I put in double quotesthe lowercase C, that's just something I chose. So I feed that in, you can see that it showedup in the global environment there on the right, we can call it forward and you see it shows upwith the double quotes on it. We've got the type of and it's a character, that's good. If we wantto do an entire string of texts, I can feed that into C two, just by having it all in the doublequotes. And we pull it out. And we see that it also is listed as a character even though in otherlanguages, it would be called a string. We can do .
Logical, this is L one for logical first. Andthen feeding in true when you write true or false, they have to be all caps, or you can do justthe capital T or the capital F. And then I call that one out. And it says true. Notice, bythe way, there's no quotes around it. That's one way you can tell it it's a logical andnot a character. If we put quotes into it, it would be a character variable, we getthe type of there we go, it's logical. I said you can also use abbreviation so for mysecond logical variable l two, I'll just use F. I feed that in. And now you see that it when I askit to tell me what it is it prints out the whole word false. And then we get the type of again alsological, then we can come down to data structures, I'm going to create a vector which is a collectionof one dimensional collection. And I'm doing it by .
Creating v one for vector one. And then I usethe C here, which stands for concatenate. You can also think of it as like combine or collect.And I'm going to put five numbers in there, you need to use a comma between the values. And then Icall out the object. Then there's my five numbers, notice it shows them without the commas but I hadto have the commas going in. And then I asked our Is it a vector is period vector and then askedabout it. And it's just gonna say true? Yes, it is. I can also make a vector of characters.And do that right here, I get the characters, and it's also a vector. And that can make a vectorof logical values true and false. Call that. And it's a vector also. Now a matrix, you may rememberis in going in more than one dimension. In this case, I'm going to call it m one for matrix one.And I'm using the matrix function. So I'm saying .
Matrix and then combine these values tt ffts. Andthen I'm saying how many rows I want in it, and it can figure out the number of columns by doing somemath. So I'm going to put that into m one. And then I'll ask for it AC. Now it displays it inthe rows and columns, and it writes out the full true or false. Now I can do another one where I'mgoing to do a second matrix and this is where I explicitly shape it in the rows and columns. Now,that's for my convenience r doesn't care that I broke it up to make the rows and columns, but it'sa way of working with it. And if I want to tell it to organize it To go by rows, I can specifythat with the by row equals T or true command. I do that. And now I have the ABCD. And yousee, by the way that I have the index numbers, on the left are the row index numbers, that's rowone and row two, and on the top are the column .
Index numbers, and they come second, which is whyit's blank and then one for the first column and then blank and then two for the second column,then we can make an array. What I'm going to do here is I'm going to create a data and I'mgoing to use the colon operator, which says, Give me the numbers one through 24, I still haveto use the concatenate to combine them. And then they give the dimensions of my array and it goesrows, columns, and then tables. Because I'm using three dimensions here, I'm going to feed that intoan object called array one. And there's my array right there, you can see that I have twotables. In fact, let me zoom in on that one. And so it starts at the last level, which isthe table. And then we have the rows and the columns listed separately for each of them.a data frame allows me to combine vectors .
Of the same length but of different types. Now,what I'm doing here is I'm creating a vector of numeric values of character values and logicalvalues. So these are three different vectors. But then what I'm going to do is I'm going touse this function c bind for a column bind to combine them into a single data frame andcall it DFA for a data frame, a, or all. Now, the trick here is that we had someunintentional coercion by just using C bind, what it did is it coerced it all to the mostgeneral format. I had numeric variables and character variables, and logical and the mostgeneral is character. And so it turned everything into a character variable. That's a problem, it'snot what I wanted, I have to add a nother function to this, I have to tell it specifically makeit a data frame by using AZ dot data dot frame. .
When I do that, I can combine it. And nowyou see it's maintained the data types of each of the variables. That's the way I wantit. And then finally, I can do a list, I'm going to create three objects here, object one,which is numeric with three values, object two, which is character with four and object three,which is logical with five. And then I'm going to combine them into a list using the list function,put them into list one, and then we can see the contents of list one. And you can see it's kind ofa funky structure, and it can be hard to read. But there's all the information there. And then we'regoing to do something that's kind of, you know, hard to get around logically, because I'm goingto create a new list that has list one in it. So I have the same three objects, plus I'm adding onto it list one. So list two, I'm gonna zoom in on .
That one. And you can see it's a lot longer. Andwe got a lot index numbers there in the brackets. There, the three integers, the four charactervalues, and the five logical values. And then here they are repeated, but that's because they're allparts of list one, which I included in this list. And so those are some of the different ways thatyou can structure data of different types. But you want to know also that we can coerce theminto different types to serve our different purposes. The next thing we need to talk about iscoercing types. Now there's automatic coercion, we've seen a little bit of that, where the dataautomatically goes to the least restrictive data type. So for instance, if we do this where wehave a one, which is numeric, be in quotes, which is character, and a logical value, andwe feed them all into this idea coerce one. And .
By the way, by putting parentheses around it, itautomatically saves it and shows us the response. Now you can see that what it's done is is takenall of them and made all of them character because that's the least specific most general format.And so that'll happen, but you kind of watch out because you don't want things getting coerced whenyou're not paying attention. On the other hand, you can coerce things, specifically, if you wantto haven't go in a particular way. So I can take this variable right here coerce to, I'm gonna puta five into that. And we can get its type and we see that it's double. Okay, that's fine. What ifI want to make it integer, then what I do is I use this command as dot integer. I run that feed intocoerce three. And it looks the same when we see the output but now it is an integer. That's howit's represented in the memory. I can also take .
A character variable and here I have one Two andThree in quotes, which thank them characters and get those and you can see that they're allcharacter. But now I can feed them in with this as dot numeric, and it's able to see that theyare numerical numbers in there, and coerce them to numeric. Now you see that is lost the quotes,and it goes to the default double precision, probably the one you'll do the most often istaking a matrix. And that's just let's take a look, I'll make a matrix of nine numbers in threerows and three columns. There they are. And what we're going to do is we're going to coerceit to a data frame. Now that doesn't change the way it looks is going to look the same. Butthere's a lot of functions you can only do with data frames that you can't do with matrices. Thisone, by the way, will ask is it a matrix? And the .
Answer is true. But now let's do this, we'lldo the same thing and just add on as dot data dot frame. Then now we thought to make it a dataframe. And you see, it basically looks the same. It's listed a little differently. This one had itsindex numbers here for the rows and the columns. This one is a row index. And then we have variablenames across the top. And it's just automatically given them variables one, two, and three.But the numbers in it look exactly the same. On the other hand, if we come back here and ask,Is it a data frame, we get true. So it's a very long discussion here. But the point here is,data comes in different types and in different structures, and you're able to manipulatethose, so you can get them in the format, and the time and the arrangement thatyou need for doing your analyses in our .
To continue our introduction and accessing data,we want to talk about factors. And depending on the kind of work that you do, this may be areally important topic. factors have to do with categories and names of those categories.Specifically, a factor is an attribute of a vector. This specifies the possible values andtheir order, it's going to be a lot easier to see if we just try it. In our end, let me demonstratesome of the variations, just open up this script, and we can run through it together. What we'regoing to do here is create a bunch of artificial data, and then we're going to see how it works.First one I'm going to do is I'm going to create a variable x one with the numbers one throughthree. And by putting it in parentheses here, it'll both stored in the environment, andit will display it in the console. So there .
We have three numbers, one, two, and three, I'mgoing to create a nother variable y, that's the numbers one through nine. So there that is. Nowwhat I want to do is I want to combine these two, and I'm going to use the C minor column binddata frame. So it's going to put them together, and it's going to make them a data frame. Andit's going to save them into a new object I'm creating called df for data frame one. And we'llget to see the results of that. Let me zoom in on it a little bit. And there you can see, wehave nine rows of data. We have one variable x one that's from the one that I created, and thenwe have y. And then we have the nine indexes or the row IDs there down the side. Please notethat the first 1x, one only had three values. And so what it did is it repeated it. So yousee it happening three different times 123123. .
And what we want to find out is now what kindof variable is x one in this data frame? Well, it's an integer, and we want to get the structure,it shows that it's still an integer if we're looking at this line right here. Okay, but we canchange it to a factor by using as dot factor. And it's going to react differently than, so I'm goingto create a new one called x two, that, again, is just the numbers one, two, and three. But now I'mtelling are those specifically represent factors, then I'll create a new data frame using this x twothat I saved as a factor and the one through nine that we had and why. Now, at this point, it looksthe same. But if we come back to where we were, and we get the type of it's still an integer,that's fine, but we get the structure of df two. Now it tells us that x two instead of beingan integer is a factor with three levels. And it .
Gives us the three levels in quotes one, two,and three, and then it lists the data. Now, if we want to take an existing variable, anddefine it as a factor, we can do that too. Here, I'll create yet another variable with threevalues in it. And then we'll bind it to y in a data frame. And then I'm going to use thisone factor right here. And I'm going to tell it to reclassify this variable x three as a factorand feed it into the same place, and that these are the levels of the factor. And because I putin parentheses, it'll show To us in the console, there we have it, let's get the type. It's aninteger, but the structure shows it again as a factor. So that's one way we could take anexisting variable and turn it into a factor. If you want to do labels, we can do it this way.We'll do x four, again, that's the one through .
Three. And we'll bind it to nine to make a dataframe. And here, I'm going to take the existing variable, df four, and then the variable is xfour, I'm going to tell it the labels. And then I'm going to give them text labels, I'm goingto say that there are Mac OS, Windows and Linux three operating systems. And please note, I needto put those in the same order that I want them to line up to those numbers. So one will be MacOS two will be windows and three will be Linux. I run that through, we can pull it uphere. And now you can see how it goes through. And it changes that factor to the textvariables. Even though I entered it numerically. I want the type of to see what it is. It's stillcalled it integer, even though it's showing me words, and the structure. This is an importantone, let's zoom in on that just for a second. .
The structure here at the bottom, itsays it's a factor with three levels, and it starts giving me the labels. But then itshows us that those are actually numbers one, two, and three underneath. If you're used toworking with a program like SPSS, where you can have values, and then you can add value labels ontop of them. It's the same kind of concept here. Then I want to show you how we can switchthe order of things. And this gets a little confusing. So try it a couple of timesand see if you can follow the logic here. We'll create another variable x five,that's just the one, two and three, we'll bind it to why. And there's our dataframe just like we've had in the other examples. Now what I'm going to do is I'm going totake that new variable x five in the data .
Frame five, df five. And notice here, I'm listingthe levels, but I'm listing them in a different order. I'm changing the order that I put themin there. And then I'm lining up these labels. When I run that through, now you can see thelabels here, maybe yes, now maybe yes, no, it is showing us the nine values. And then thisis an interesting one, because they're ordered, it puts them with the less than sign at each pointindicate which one comes first which one comes later, we can take a look at the actual data framethat I made. Or zoom in on that. And you can see, we know that the first one's a one because whenI created this, it was 123. And so the maybe is a one you see because it's the second thing herein each one. So one equals maybe. But by putting it in this order, it falls in the middle of thisone, there may be situations in which you want .
To do that, I just want to know that you have thisflexibility in creating your factor labels in our. And finally, we can check the type of that.And it's still an integer because it's still coded numerically underneath, but we canget this structure and see how that works. So factors give you the opportunityto assign labels to your variables, and then use them as factors in various analysesif you do experimental research, and this sort of thing becomes really important. And sothis gives you an additional possibility for your analyses in our as you define your numericalvariables as factors for using your own analyses. Our next step in our an introduction inaccessing data is entering data. So this is where you're typing it in manually. And I liketo think of this as a version of ad hoc data, .
Because under most circumstances, you wouldimport a data set. But there are situations in which you need just a small amount of dataright away, and you can type it in this way. Now, there are many different methods that areavailable for this. There's something called the colon operator. There's SC Q, which is forsequence, there, C which is short for concatenate, there's a scan, and there's Rep. And I'm goingto show you how each of these works. I will also mention this little one, the less than and adash, that is the assignment operator in our let's take a look at it in our and I'll explainhow all of it works. Just open up this script, and we'll give it a whirl. What we're going todo here is just begin with a little discussion of the assignment operator, the less thandash is used to assign values to a variable, .
So is called an assignment operator. Now alot of other programs would use an equal sign, but we use this one that's like an arrow,and you read it as it gets. So x gets five, it can go in the other direction pointing to theright, that would be very unusual. And you can use an equal sign or knows what you mean. But thoseare generally considered poor form. And that's not just arbitrary. If you look at the Googlestyle guide for our it's specific about that. In our studio, you have a shortcut for This, ifyou do option dash, it inserts the assignment operator and a space. So I'll come down here rightnow, do option dash, there you see. So that's a nice little shortcut that you can use in ourstudio when you're doing your ad hoc data entry. Let's start by looking at the colon operator. Andmost of this you would have seen already. And what .
This means is you simply stick a colon between twonumbers, and it goes through them sequentially. So I'm doing x one is a variable that I'm creating.And then I have the assignment operator and get zero colon 10. And that means it gets thenumbers zero through 10. And there they all are going to delete my colon operator that'swaiting for me to do something here. Now if we want to go in descending order, justput the higher number first. So I'll put 10 colon zero, there it goes the other way, as EQ or SECis short for sequence, and it's a way of being a little more specific about what you want. If youwant to, we can call it the help on sequence. It's right over here for sequence generation. There'sthe information. And we can do ascending values. So sec 10, duplicate one through 10 doesn't startat zero starts at one. But you can also specify .
How much you want things to jump by. So if youwant to count down in threes, II do 30 to zero by negative three means step down threes, we'llrun that one. And because it's in parentheses, it'll both save it to the environment, andit'll show it on the console right away. So those are ways of doing sequential numbers.And that can be really helpful. Now if you want to enter an arbitrary collection of numbers indifferent order, you can use C that stands for concatenate, you can also think of it as combineor collect, we can call it the help on that one. There it is. And let's just take these numbersand you see to combine them into the data object x five, and we can pull it in there you see,it just went right through. An interesting one is scan. And this is we're entering data live. Sowe'll do scan here, get some help on that one, you .
Can see it read data values. And this one takes alittle bit of explanation, I'm going to create an object x sex. And then I'm feeding into it a scanwith opening and closing parentheses because I'm running that command. So here's what happens, Irun that one. And then down here in the console, you see that it now has one and a colon. And Ican just start typing numbers. And after each one, I hit Enter. And I can type in however manyI want. And then when you're done, just hit enter twice. And it reads them all. And if youwant to see what's in there, come back up here and just call the name of that object. There arethe numbers that I entered. And so there may be situations in which that makes it a lot easier toenter data, especially if you're using a 10 key. Now, rep you can guess is for repetition. We'llcall the help on that one, replicate elements. .
And here's what we're going to do, we're going tosay x seven, we're going to repeat or replicate. True, and we're going to do it five times. So xseven. And then if you want to see there are our five trues. All in a row. If you want to repeatmore than one value, it depends on anything, set things up a little bit. Here, I'm going todo replicate a repeat for true and false. But by doing it as a set where I'm doing thesee concatenate to collect the set, what it's going to do is repeat that set in order fivetimes. So true, false, true, false, true, false, and so on. That's fine. But if you want to do thefirst one, five times, and then the second one, five times, I mean, think of it as like co lading.On a photocopier. If you don't want it correlated, you do each. And that's going to do True, true,true, true, true false, false, false, false false. .
And so these are various ways that you can setup data, get it in really for an ad hoc or an as needed analysis. And it's a way of checking howfunctions work is I've used in a lot of examples here. And you can explore some as possibilitiesand see how you can use it in your own work. The next step in our introduction, and accessingdata is talking about importing data, which will probably be the most common way of getting datainto R. Now the goal here is you want to try to make it easy. Get the data in there, get a largeamount, get it in quickly and get processing as soon as you can. Now there are a few kinds ofdata files you might want to import. There are CSV files, S stands for comma separated values ina sort of the plain text version of a spreadsheet. Any spreadsheet program can export data as a CSVand nearly any data program at all can read them. .
Open up this script, and we'll run throughthe examples all the way through. But there is one thing you're going to want to do first,and that is, you're going to want to go to the course files that we download at the beginning ofthis course, these are the individual our scripts, because this folder right here that significant.This is a collection of three data sets, I'm going to click on that. And they're allcalled m BB. And the reason they're called that is because they contain Google Trends information.And that searches for Mozart, Beethoven, and Bach, three major classical music composers. And it'sall about the relative popularity of these three search terms over a period of several years.And I have it here in CSV or comma separated value format, and as a txt file dot txt,and then even as an Excel spreadsheet. Now .
Let's go to our and we'll open up each one ofthese. The first thing we're going to need to do is make sure that you have reo. Now I'vedone this before that Rio is one of the things I download every time. So I'm going to use PacMan and do my standard importing or loading of packages. So reals available now, I do wantto tell you one thing significant about Excel files. And we're going to go to the official ourdocumentation for this. If you click on this, it'll open up your web browser. And this isa shortcut web page to the our documentation. And here's what it says. I'm actually readthis verbatim. Reading Excel spreadsheets. The most common our data import export questionseems to be how do I read an Excel spreadsheet. This chapter collects together advice and optionsgiven earlier. Note that most of the advices for .
Pre Excel 2007 spreadsheets and not the later XLSx format. The first piece of advice is to avoid doing so if possible. If you have access to excel,export the data you want from excel in a tab delimited or comma separated form, and use readdot delete or read dot CSV to import it into R, you may need to use read.dl m to or read dot CSVto and a locale that uses comma as the decimal point, exporting a diff file and reading itusing read dot diff is another possibility. Okay, so really what they're saying is, don't do it.Well, let's go back to our now it's gonna say right here, you have been warned. But let's makelife easy by using Rio. Now if you've saved these three files to your desktop, then it's reallyeasy to import them this way. We'll start with the CSV. We use reo underscore CSV is the name of theobject that I'm going to be using to import stuff .
Into. And all we need is this command import. Wedon't have to specify that as a CSV, or C that has headers or anything, we just use import. Andthen in quotes, and in parentheses, we put the name and location of the file. So on a Mac, itshows up this way to your desktop. I'm going to run that. And you can see that it justshowed up in my environment on the top right, I'll expand that a little bit. I now have a dataframe, I'll come back out. Let's take a look at the first few rows of that data frame. I'llzoom up. And you can see we have months listed. And then the relative popularity of search forMozart, Beethoven and Bach during those months. Now, if I want to read the text file, what'sreally nice is I can use the exact same command import, and I just give the location in the nameof the file, I have to add the dot txt. But I run .
That and we look at the head and you'll see it'sexactly the same no difference Piece of cake. What's nice about Rio is I can even do theXLS x file. Now it helps that there's only one tab in that file, and that it'sset up to look exactly the same as the others want to do that. We wentthrough and you see that once again. It's the same thing Rio was able to read allof these automatically makes life very easy. Another neat thing is that our hands on thingcalled a Data Viewer. Now we'll get a little bit of information on that to help and you invokethe Data Viewer. Let's do this one we do with a capital V for view. And then we say what it is wewant to see. And we'll do rio underscore CSV. When we do that command, it opens up a new tab here.And it's like a spreadsheet right here. And in .
Fact, it's sortable, we can click on this, go fromthe lowest to the highest, and vice versa. And you see that Mozart actually is setting the rangehere. And that's one way to do it. You can also come over to here and just click on this little,it looks like a calendar. But it is, in fact, the same thing, we can double click on that. And nowyou see we get a viewer of that file as well. I'm going to close both of those. And I'm just goingto show you the built in our commands for reading files. Now, these are ones that Rio uses on itsown. And we don't have to go through all this. But you may encounter these in a lot of existingcode, because not everybody uses Rio. And I want you to see how they work. If you have a text file,and it's saved in tab delimited format, you need the complete address. And you might try to dosomething like this read dot table is normally .
The command. And you need to say that you have aheader that there's variable names across the top. But when you read this, it's going to get an errormessage. And it's you know, it's frustrating. That's because there are missing values in therein the top left corner. And so what we need to do is we just need to be a little more specific aboutwhat the separator is. And so I do the same thing, I say read dot table, there's the name of the filein this location, we have a header. And this is where I say the separator is a tab, the back scoresays indicate this is a tab. So if I run that one, then it shows up, it reads it properly. We canalso do CSV. The nice thing here is you don't have to specify the delimiter. Because CSVmeans that it's comma separated, so we know what it is. And I can read that one in the exactsame way. And if I want to, I can come over here. .
And I can just click on the viewer here. And Isee the data that way also. And so it's really easy to import data, especially if you use thepackage Rio, which is able to automatically read the format and get it in properly and get youstarted on your analyses as soon as possible. Now, the part of our introduction that maybemost of you were waiting for is modeling data. On the other hand, because this is a very shortintroductory course, I'm really just giving a tiny little overview of a handful of common procedures.And an another course here at data lab.cc, we'll have much more thorough investigationsof common statistical modeling and machine learning algorithms. But right now, I just wantto give you a flavor of what can be done in R. And we'll start by looking at a common procedure.hierarchical clustering are ways of finding which .
Cases or observations in your data belong witheach other. More specifically, you can think of it as the idea of like with like, which casesare like other ones. Now, the thing is, of course, this depends on your criteria, how you measuresimilarity, how you measure distance, and there's a few decisions you have to make. You can do, forinstance, what's called a hierarchical approach, which is what we're going to do. Or you can do itwhere you're trying to get a set number of groups, or s called K, the number of groups, you alsohave many choices for measures of distance. And you also have a choice between what'scalled divisive clustering, where you start with everything in one group, and then you splitthem apart, or agglomerative, which is where they all start separately, and you selectively putthem together. But we're going to try to make .
Our life simple here. So we're going to do thesingle most common kind of clustering, we're going to use a measure of Euclidean distance,we're going to use hierarchical clustering. So we don't have to set the number of groups inadvance. And we're going to use a divisive method, we start with them all together and graduallysplit them. Let me show you how this works in our. And what you'll find is even though thismay sound like a very sophisticated technique, and a lot of the mathematics is sophisticated,it's really not hard to do in reality. So what we're going to do here is we're goingto use a data set that we use frequently I'm going to load my default packages to get some ofthis ready. And then I'll bring in the data sets, we're going to use m t cars, which if you recall,is Motor Trend, car road tests data from 1974. And .
There are 32 cars in there and we're gonna see howthey grew up what cars are similar to which other ones. Now let's take a look at the first few rowsof data to see what variables we have in here. You see we have MPG, cylinders displacement, so onand so forth. Not all of these are going to be really influential or are useful variables. And soI'm going to drop a few of them and create a new data set, that includes just the ones I want. Ifyou want to see how I do that, I'm going to come back here and I'm going to create a new object,a new data frame called cars. And this says, it gets the data from empty cars. By putting theblank in the space here, that means use all of the rows. But here I'm selecting the columns seefor concatenate, means I want columns one through four, skip Five, six, and seven, skip eight, andthen nine through 11. That's way of selecting my .
Variables. So I'm going to do that and you seethe cars is now showing up in my environment, they're at the top right, let's take a lookat the head of that data set. We'll zoom in on that one. And they can see it's a little bitsmaller, we have mpg cylinders, displacement, weight, horsepower, quarter mile, seconds, and soon. Now, we're going to do the cluster analysis, and we're going to find is that if we're usingthe default, it's super, super easy. In fact, I'm going to be using something called pipes, which isfrom the package D plier, which is why I loaded it is this thing right here. And what it allows youto do is to take the results of one step and feed it directly in as the input data into the nextstep. Otherwise, this would be several different steps. But I can run it really quickly, I'm goingto create an object called h c for hierarchical .
Clusters, we're going to read the cars data thatI just created, we're going to get the distance or the dis similarity matrix, which says how fareach observation is in Euclidean space from each of the others. And then we feed that through thehierarchical cluster routine h clust. So that saves it into an object and now we need to do isplot the results. We're gonna do plot H, see my hierarchical cluster object, then we get thisvery busy chart over here. But if I zoom in on it, and wait a second, you can see that it's thisnice little, it's called a dendrogram. Because it's a branches and trees looks more like rootshere, you can see they all start up together, and then they split and then they split and theysplit. Now if you know your car's from 1974. And you can see that some of these things make sense.So for instance, here we have the Honda Civic and .
The Toyota Corolla, which are still in productionare right next to each other, if you're 128. And if yacht x one nine are very well, they were bothsmall Italian sports cars, they were different in many ways. But you can see that they're right nextto each other. The Ferrari Dino, the Lotus Europa, they make sense to put next to each other. Ifwe come over here, the Lincoln Continental and the Cadillac Fleetwood and the Chrysler Imperial,it's no surprise that are next to each other. What is interesting is this one here, the mangiarottiBora, it's totally separate from everything else, because it's a very unusual different kind of carat the time. Now, one really important thing to remember is that the clustering is only valid forthese data points, based on the data that I gave it, I only gave it a handful of variables.And so it has to use those ones to make the .
Clusters. If I gave it different variables ordifferent observations, we could end up with a very different kind of clustering. But I want toshow you one more thing we can do here with this clusters to make it even easier to read. Let mezoom back out. And what we're going to do is draw some boxes around the clusters, we're going tostart by drawing two boxes that have gray borders. Now I'm going to run that one. And you can seethat it showed up. And then we're going to make three blue ones, four green ones, and five darkred ones. And then let me come and zoom in on this again. And now it's easier to see what the groupsare in this particular data set. So we have here, for instance, the Hornet for drive, the valley andthe Mercedes Benz, 450, SLC, Dodge, challenger, and Javelin all clumping together in one generalgroup. And then we have these other really big .
VAT American cars. What's interesting is again, isthat the MAS Ronnie Bora is off by itself almost immediately. It's kind of surprising because theFord Panthera has a lot in common with it. But this is a way of seeing based on the informationthat I gave it, how things are clustered. And if you're doing market analysis, if you're trying tofind out who's in your audience, if you're trying to find out what groups of people think in similarways, this is an approach that you're probably going to use. And you can see that it's reallysimple to set it up, at least using the default in our as a way of seeing how you have regularitiesand consistencies in groupings in your data. As we go through our very brief introduction tomodeling data and are another common procedure that we might want to look at briefly, is calledprincipal components. And the idea here is that .
In certain situations, less is more. That is lessnoise, and fewer unhelpful variables in your data can translate to more meaning and that's whyAfter In any case, now, this approach is also known as dimensionality reduction. And I like tothink of it by an analogy, you look at this photo, and what you see are these big black outlines ofpeople, you can tell basically how tall they are, what they're wearing, where they're going. Andit takes a moment to realize that you're actually looking at a photograph that goes straight down.And you can see the people there on the bottom, and you're looking at their shadows. And we'retrying to do a similar thing. Even though these are shadows, you can still tell a lot aboutthe people, people are three dimensional, shadows are two dimensional, but we've retainedalmost all the important information. If you .
Want to do this with data, the most commonmethod is called principal component analysis, or PCA. And let me give you an example of thesteps metaphorically in PCA. You begin with two variables. And so here's a scatterplot, we've gotx across the bottom y decide, and this is just artificial data. And you can see that there's astrong linear association between these two. Well, what we're going to do is we're going to drawa regression line through the data set, and you know, it's there about 45 degrees. And then we'regoing to measure the perpendicular distance of each data point to the regression line. Now, notthe vertical distance, that's what we would do if we were looking for regression residuals, but theperpendicular distance. And that's what those red lines are, then what we're going to do is we'regoing to collapse the data by sliding each point .
Down the red line to the regression line. Andthat's what we have there. And then finally, we have the option of rotating it. So it's not ondiagonal anymore, but it's flat. And that there is the PC the principal component. Now, let'srecap what we've accomplished here, we went from a two dimensional data set to a one dimensionaldata set, but maintained some of the information in the data. But I like to think that we'vemaintained most of the information. And hopefully, we maintain the most important information inour data set. And the reason we're doing this is we've made the analysis and interpretationeasier and more reliable. By going from something that was more complex, two dimensional or higherdimensions, down to something that's simpler to deal with fewer dimensions, it means easier tomake sense of in general, let me show you how .
This works in our open up this script. And we'llgo through an example in our studio. To do this, we'll first need to load our packages,because I'm going to use a few of these. Although those will load the data sets. Now I'mgoing to use the empty cars data set, we've seen that a lot. And I'm going to create a littlesubset of variables. Let's look at the entire list of variables. And I don't want all of thosein my particular data set. So the same way I did with hierarchical clustering, I'm going to createa subset by dropping a few of those variables. And we'll take a look at that subset. Let's zoomin on that. So there's the first six cases in my slightly reduced data set. And we're going touse that to see what dimensions we can get to that we have fewer than the 123456789 variables wehere. Let's try to get to something a little less .
And see if we still maintain some of the importantinformation in this data set. Now what we're going to do is we're going to start by computingthe PCA, the principal component analysis, we'll use the entire data frame here, I'm goingto feed into an object called PC for a principal components. And there's more than one way to dothis in our but I want to use p r comp. And this specifies the data set that I'm going to use.And I'm going to do two optional arguments. One is called centering the data, which means movingthem so the means of other variables are zero. And then the second one is scaling the data which sortof compresses or expands the range of the data. So it's unit or variance of one for each of them.That puts all of them on the same scale. And it keeps any one variable from sort of overwhelmingthe analysis. So let me run through that. .
And now we have a new object that showed upon the right. And if you want to you can also specify variables by specifically includingthem. The tilde here means that I'm making my prediction based on all the rest of these. And Ican give the variable names all the way through. And then I say what data set it's coming from.I say data equals empty cars, and I can do the centering in the scaling there. Also, it producesexactly the same thing. It's just two different ways of saying the same command. To examine theresults, we can come down and get a summary of the object PC that I created. So I'll click on thatand then we'll zoom in on this. And here's the summary it talks about creating nine componentspc one for principal component one to PC nine for principal component Nice, you get the same numberof components that you had as original variables. .
But the question is whether it divvies up thevariation separately. Now, you can take a look here at principal component one is the standarddeviation of 2.3391. What that means is, if each variable will begin with a standard deviation ofone, this one has as much as 2.4 of the original variables, the second one has 159, and the othershave less than one unit standard deviation, which means they're probably not very importantin the analysis, we can get a scree plot for the number of components and get an idea on howmuch each one of them explains of the original variance. And we see right here, I'll zoom in onthat, that our first component seems to be really big and important. Our second one is smaller, butit still seems to be you know, above zero, and then we kind of grind out down to that one. Nowthere's several different criteria for choosing .
How many components are important what you wantto do with them. Right now, we're just eyeballing it. And we see that number one is really bignumber two, sort of a minor axis in our data. If you want to, you can get the standard deviationsand something called the rotation here, I mean, just call PC. And then we'll zoom in on that inthe console. to scroll back up a little bit. And it's a lot of numbers. The standarddeviations here are the same as what we got from this first row right here. So thatjust repeats it. The first one's really big, the second one's smaller. And then whatthis right here does, what the rotation is, it says, What's the association betweeneach of the individual variables and the nine different components. So youcan read these like correlations. .
I'm going to come back. And let's see howindividual cases load on the PCs. What I do that is I use predict runningthrough PCs, and then I feed those results using the pipe. And I round themoff, so they're a little more readable. I'll zoom in on that. And here,we've got nine components listed, and we got all of our cars. But the first twoare probably the ones that are most important. So we have here the PC one and two easy, wegot a giant value there, 2.49273354, and so on. But probably the easiest way to deal with all thisis to make a plot. And what we're going to do is go something with a funny name of biplot. Whatthat means is a two dimensional plot, really, all it says is going to chart the first twocomponents. But that's good, because based on .
Our analysis, it's really only the first two thatseem to matter anyhow. So let's do the biplot, which is a very busy chart. But if we zoom onit, we might be able to see a little better what's going on here. And what we have is thefirst principal component across the bottom, and the second one up the side. And then thered lines indicate approximately the direction of each individual variables contribution tothese. And then we have each case we show its name about where it would go. Now if you rememberfrom the hierarchical clustering, the Maasai Bora was really unusual. And you can see it's upthere all by itself. And then really, what we seem to have here is displacement and weight andcylinders, and horsepower. This appears to be big, heavy cars going in this direction. Then wehave the Honda Civic, the Porsche 911, Lotus, .
Europa, these are small cars with smaller enginesmore efficient. These are fast cars up here. And these are slow cars down here. And so it's prettyeasy to see what's going on with each of these as in terms of clustering the variables. With ahierarchical clustering, we clustered cases, now we're looking at clusters of variables.And we see that it might work to talk about big versus small and slow versus fast as the importantdimensions in our data as a way of getting insight to what's happening and directing us in oursubsequent analyses. Let's finish our very short introduction to modeling data in our with a briefdiscussion of regression, probably one of the most common and powerful methods for analyzing data. Ilike to think of it as the analytical version of E Pluribus Unum that is out of many one, or inthe data science sense, out of many variables, .
One variable, you want to put out one moreway out of many scores, one score. The idea with regression is that you use many differentvariables simultaneously, to predict scores on one particular outcome variable. And there's somuch going on here. I'd like to think that there's some For everyone, there are many versions, andmany adaptations of regression that really make it flexible, and powerful for almost no matterwhat you're trying to do, we'll take a look at some of these in our so let's try it in our andjust open up this script. And let's see how you can adapt regression to a number of differenttasks and use different versions of it. When we come here to our script, we're going to scrolldown here a little bit and install some packages, we're going to be using several packages in thisone, I'll load those ones as well as the datasets .
Package. Because we're going to use a data setfrom that called us judge radians. Let's get some information on it. It is lawyers ratings of statejudges in the US Superior Court. And let's take a look at the first few cases with head I'll zoomin on that. And what we have here are six judges listed by name. And we have scores on a number ofdifferent variables like diligence and demeanor. And whether it finishes with whether they'reworthy of retention, that's the RTN retention. Let's scroll back out. And what we might want todo is use all these different judgments to predict whether lawyers think that these judges should beretained on the bench. Now, we're going to use a couple of shortcuts that can actually make workingwith regression situations kind of nice. First, we're going to take our data set, and we're goingto feed it into an object called data. So that .
Shows up now in our environment on the top right.And then we're going to define variable reps, you don't have to do this, but it makesthe code really, really easy to use. Plus, you find if you do this, then you can actuallyjust use the same code without having to redo it every time you do an analysis. So what we're goingto do is we're going to create an object called x, it's actually going to be a matrix, and it'sgoing to consist of all of our predictor variables simultaneously. And the way I'm going to do thisis I'm going to use as matrix and then I'm gonna say read data, which is what we defined righthere, and read all of the columns except number 12. That's one called retention, that's ouroutcome. So the minus means don't include that, but do all the others. So I do that, and now Ihave an object called x. And then the second one, .
I say, go to data. And then this blank means useall of the rows, but only read the 12th column, that's the one that has retention our outcome.So following standard method, x, those are all our variables and why that's our single outcomevariable. Now, the easiest version of regression is called simultaneous entry, you use all of the xvariables at once, throw them in one big equation to try to predict your single outcome. And in ourwe use lm, which is for linear model. And what we have here is y, that's our outcome variable. Andthen the tilde that means is predicted by or as a function of x. And then x is all of our variablestogether being used as predictors. So this is the simplest possible version, and we'll save it intoan object called reg for regression one. And now, if you want to be a little more explicit, you cangive the individual variables you can say that our .
10 retention is a function of or as predicted byall of these other variables. And then I say that they come from the data set us judge ratings thatwe don't have to do the data, and then dollar sign for you to these. That'll give me the exact samething. So I don't need to do that one explicitly. If you want to see the results, we just call onthe object that we created from the linear model. And I'm going to zoom in on that. And what wehave are the coefficients. This is the intercept, start with minus two. And then for each stepup on this one, as 0.1, point three, six, so on and so forth. You'll see By the way, thatit's changed the name of each of the variables to add the x because they're in the dataset xnow, that's fine. We can do inferential tests on these individual coefficients by asking fora summary. We click on that. And we'll zoom in. .
And now you can see, there's the value that wehad previously, but now there's a standard error. And then this is the t test. And then over here isthe probability value. And the asterisks indicate values that are below the standard probabilitycutoff of point oh five. Now we expect the intercept to be below that. I see. For instance,this one integrity has a lot to do with people's judgment of whether a person should be retained.And this one physical really, are they sick, and we have some others that are kind of ontheir way. And this is a nice one overall. And if you come down here, you can see the multipler squared. It's super high. And what it means is that These variables collectively predicted very,very well, whether the lawyers felt that the judge should be retained. Let's go back now to ourscript, you can get some more summary data here, .
If you want, we can get the analysis of variancetable, the ANOVA table, and we click on that zoom in there, you can see that we have our residualsand the y. Come back out, we do the coefficients. Here are the regression coefficients, we sawthose previously, this is just a different way of getting at this same information, we canget confidence intervals. Let's zoom in on that. And now we have a 95% confidence interval. So thetwo and a half percent, on the low end the nine, seven and a half on the top end, in termsof what each of the coefficients would be. We can get the residuals on a case by case basis,let's do this one. And when we zoom in on that, now, this is a little hard to read in and ofitself, because they're just numbers. But an easier way to deal with that is to get a histogramof the residuals from the model. So to do that, .
We just run this command, and then I'll zoomin on this. And you can see that it's a little bit skewed mostly around zero, we've got oneperson we have on the high end, but mostly, these are pretty good predictions. Come back out.Now I want to show you something a little more complicated. We're going to do different kindsof regression, I'm going to use two additional libraries for this one is called Lars thatstands for least angle regression, and carat, which stands for classification and regressiontraining. We'll do that by loading those two. And then we're going to do a conventional stepwiseregression, which a lot of people say there's problems with this, but I'm just gonna show thatI'm gonna do it really fast. There's our stepwise regression, then we're going to do something fromLars called stage wise, it's similar to stepwise, .
But it has better generalizability. We run thatthrough, we can also do least angle regression. And then really, one of my favorites is thelasso. That's the least absolute shrinkage and selection operator. Now I'm running throughjust the absolute bare minimum versions of these, there's a lot more that we would want to doexplore these. But what I'm going to do is compare the predictive ability of each of them.And I'm going to feed into an object called R to conference comparison of the R squaredvalues. And here I specify where it is, in each of them, I have to give a little indexnumber, then we're going to round off the values. And I'm going to give them the name, say the firstone stepwise and forward then larger than lasso. And we can see the values. And what this shows ushere at the bottom is that all of them were able .
To predict it super well. But we knew that becausewhen we did just the standard simultaneous entry, there was amazingly high predictive ability withinthis data set. But you will find situations in which each of these can vary a little bit, maybesometimes they vary a lot. But the point here is there are many different ways of doing regressionand are makes those available to whatever you want to do. So explore your possibilities andsee what seems to fit. In other courses, we will talk much more about what each of thesemean, how they can be applied and how it can be interpreted. But right now, I simply want youto note that these exist, and they can be done, at least in theory in a very simple way inour. And so that brings us to the end of our an introduction. And I want to make a briefconclusion primarily to give you some next steps, .
Other things that you can do. As you learnto work more with our now we have a lot of resources available here. Number one, we haveadditional courses on our in data lab.cc. And I encourage you to explore each of them. If youlike our you might also like working with Python, another very popular language for working in datascience, which has the advantage of also being a general purpose programming language. The thingsthat we do in our we can do almost all the same things in Python. And it's nice to do a compareand contrast between the two with the courses we have at data lab.cc. I'd also recommend you spendsome time simply on the concepts and practice of data visualization. R has fabulous packages fordata visualization. But understanding what you're trying to get and designing quality ones is sortof a separate issue. And so encourage you to get .
The design training from our other courses onvisualization. And then finally, a major topic is machine learning or methods for processinglarge amounts of data and getting predictions from one set of data that can be applied usefully toothers. We do that for both R and Python and other mechanisms here in data lab. Take a look at all ofthem and see how well you think you can use them in your own work now Another thing that you cando is you can try looking at the annual our user conference, which is user with a capital R and anexclamation point. There are also local our user groups are rugs. And I have to say Unfortunately,there is not yet an official our day. But if you think about September 19, it's International TalkLike a Pirate Day. And we like to think as pirates say are and so that can be our unofficial dayfor celebrating these statistical programming .
Language are any case, I'd like to thank you forjoining me for this and I wish you happy computing