Tuesday, May 24, 2022

What is Spark, RDD, DataFrames, Spark Vs Hadoop? Spark Architecture, Lifecycle with simple Example

Welcome to tech primers in this video we are going to see a little bit of information about Apache spark so what is the passive spark what does it do so how does it differ from Apache Hadoop and internally how does support a spark work and what is the life cycle of an Apache spark and what are the different terms involved in manipulating the data .

Using Apache spark but also finally we are going to see some live examples using a party spark so I have downloaded up at a spot we can do something about that so there are some documentation in Apache spark website so we are going to use the documentation to literally bring up a package back server and we are going to try that Apache spark computing .

Engine so let's do that so before going into what is Apache spark so if you are familiar with Apache Hadoop these are different tools for querying big data so basically these are different tools for processing and transforming big data and mining them and getting meaningful information out of the huge amount of data which you have so previously if you .

See Hadoop was a big thing but the Hadoop also has performance issues that is what the spark is overcoming with so let's see water supply spark right so basically what is park so spark is an open source cluster computing framework which was created by a group of people okay to overcome what was a problem in hood okay so what internally spark does .

Is it does real-time data processing with huge amount of data which basically Hadoop is lagging it so if you know about Hadoop Hadoop can do – processing only however Apache spark can do real-time data processing as well as batch processing at the same time so it can be used for real-time data processing as well and it can be used .

For batch for being as well so initially spark was developed by people from the University of California in the which is inversely so that's called a mplab so another full form I think it's somewhere something like algo mission or machine people or something like that so so they created the the Spartanburg gate and later on they move that to Apache .

Foundation so that is why it is now called Apache spark so initially it was just called spark okay so now there are lots of active community there are lots of people who are contributing to this possibility and it is growing day by day and it is one of the highly valued frameworks in the big data world now currently okay so why .

You spark why not her loop show I see just disgusted so the MapReduce all's only batch processing problem however spark does real-time data problems instance and also the other thing to mention is arc is almost 10 times faster than hood that's what they claim so that is all this has to be same way as I mentioned spark falls be general-purpose .

Computing system which is real-time data and batch processing boot but the Hadoop does only batch possible so that is a major advantage of spark and also the number of lines of code which you write a spark is less compared to what we wrote in how to so Hadoop if you know it was written in Java but spark is written in Scala .

So Scala is another language which is written over the JVM with a functional programming in idea right and so he the Scala code it is all function so Java Java Java is now moving to functional style however spark is that earlier so so so Apache spark is like so it's colored is generally so apart a spark is written on Scala ok so what does Apache .

Spark is composed on right so initially they have a spark or library so that is what is going to control everything so it is like the heart of spark so over that we have sparks equal so that is like a SQL interface and nothing using this you can query spot code or you can even you can query everything so you can use the SQL like syntaxes to query .

Sparks so that is this part sequel that is one part of it same way the next one is Park streaming so you can query streaming data from spot using this you can query and put both basically you just put the data and then get it back so the next one is machine like a machine learning library so M lift so that is one part of .

Park as well so there are machine learning libraries inbuilt in spots so that you can use them to to some machine learning related operations finally graphic so if you want to do crafts it related storage and you want to retrieve store data in the form of graphical representation then you can use that as well so these are different core .

Component insights part and finally how do we get the data so we get the data using data frames so so the data frames abstracts all these out so using the data frames we have to get the data from spark so this is generally the high level of how spark is paid so this is how sparks internal looks like okay so what are the different components inside .

A spark now that we saw what is what is there in spark what are the different components inside a spark architecture so how does the spark maintain these resonate stays and failures and stuff like that right so so spark has something called a driver okay so the driver is like a master so that is the one which is going to command everyone .

Saying okay do this and do that something like that so enter the driver we have something called spark context so this is similar to the application context if you had used a screen so this is going to have all the data which we store or something like that okay so then these path contexts are going to control the focus threads or they work .

On nodes basically so inside the poconos we have executors and these executors basically execute some tasks so there is a driver which acts like a master and the driver instructs the workers to do some tasks so that is all done by this Park context so spark context instructs the workers to execute some tasks on each node okay .

So if a particular task fails that gets rebuilt and pass context rebuilt that and sends it to the worker again okay so that is how fault tolerance handled sohow fault-tolerant there is the handle we are going to see that what technology they use and what concepts they use so that's it about the component so there are different .

Abstractions which are done in spark so that you don't know what's happening internally so you don't have to know what component types what how it is done and stuff like that so we are going to see what are the different abstractions so the first one is the resilient data this data distributed data sets basically these are called RTD so this .

Is the core component of spark so whenever you are operating on any data you will use an RTD okay so that is called the resilient distributed data sets so these are nothing but the data sets which are reconstructed on the nodes so these are basically data which is generated whenever you want to do something on that owner on a data okay .

And then if there is a failure these data gets reconstructed as written okay so this oddity gets it reconstructed and written these are basically immutable data and then you can do some transformation let's cover that in depth in the coming slides let's move the next subsection so the next subsection is the data frames or .

The data test so this we would have heard right from the previous slides so data frames are nothing but an abstraction which is going to cover up all these pink or a seek sequel streaming and machine learning library which spark has so data frames is another level of abstraction which is provided by spark there is something .

Called D streams as well these things are for sparking a spark littered streaming API so that is nothing but an API so this change is another API which we use for stream processing okay so now let's see what is our DD right so we as I said our duty is resilient distributed data sets but what are these used for so these are used for transforming data .

Okay so these are used to generate datasets on which you can transform data so if you see the definition right the transformations are generated as a psychic directed acyclic graph basically Oddity uses the concept of tag so it uses to transform objects or transform data fits okay so the dad can be recomputed .

To the failure so whenever there is a failure in the node whenever there is a failure in the worker or the executor thread or the node these DAGs are recomputed basically they are recreated okay and finally there are transformations happening on this particular RDD so what are the different types of transformations that are there .

Is something called map that is something called filter there is black map there are text files etc so you would have heard all these in Java streams so these are just transformation techniques to transform data from one type to another so these are just general concepts of transformation so that is where you see the same names map .

Filter flat map and x5 okay so that is our DD so our D D is going to be the key for us to transform data using a purchase part and internally it can use map with the flat map and stuff like that okay and as I said earlier these are immutable so once you create an RDD you cannot change it so let's say when you .

Are querying a data in the cluster so whenever you do a RDD creation it gets created and it is immutable so when the data gets updated on the cluster D this oddity has to be reconstructed okay or recomputed basically okay so that is why it is immutable so Oddity is immutable so you just it just creates a snapshot of the data at that instance so it .

Creates a dataset from that instance at that particular point of time okay now let's move to the lifecycle inspark so how does the typical life cycle in spark look like so how do i how do i put the data how do i get the data so how do I need to do that right so the initial part is loading the data so that is when different data sources come into picture .

So you can load stream streaming data you can load the data data from the database Anika any database like Cassandra or relational database or even the hsql sorry the HDFS file system so you can all or even the s3 noise on s3 system so all these are as data sources into spark so it can .

Accept any type of data source so whether you have file system or the relational database or the key value store or anything else so if you have the data store somewhere it can be loaded as a data store using the data sources okay the next one would be the transformation so once the data is loaded we need to transform the data so .

Transformation is what is going to play a key role in transforming these data data so using the transformation we use the map for the transformation we use in my filter map you have seen that right so once the transformation is complete we have to perform an action of the transformed data so that is what action means so once you let's say you are .

Going to filter out some data so you have a collection of let's say gobiins collection you have you want to filter only a or E so then you do some operation that is transformation and then finally you ntek saying okay I am done just collect this or group by or do something with that reduce it or this compress it so that is an action and .

Finally whatever data we have got it should be pushed to the UI dashboard for real-time analysis or it can be persisted for future research in the form of process to data because these are processed IDs are processed at that particular instance so this is how a typical life cycle of spark looks like so looks like though if I convert this .

Into point so low the data into the cluster you need to create RTD so once you have created the RDD you have to do transformation once you have done the transformation go to perform an action once the action is done you have to create data frames out of it using the data frames you can query the query the data okay or you can even run equals .

On the data frames okay so that's it about spark so let's slice what is part is right so let's go ahead and dry spot so I have installed a spark already if you it's not an installation process it's just a binary towel so I downloaded spark quantity and I have this unzipped it in my downloads folder so this is the .

Current version of spark which is out there it has part 2.1 and then it said it had heard of 2.7 input so i am not having any cluster nothing it's just playing laptop i just have a map of pro so let's try what it does right so i'm in the bin folder if you see let's go to the structure of d hadoop package so you have charge you have notice you have .

Yard you have as ben you have you right then like into the company the different folders where these are pre packaged and given for us so there are some examples that's it so we are going to see those examples from the documentation which Apache has provided so follow me we are going to first go ahead and start apache spark ok so in order to do that we need .

To run a command called spark shell okay so that I'm going to pin photo yeah so I need to run a command called spark say ok now I need to provide what is it it's a market and I need to give the host where it needs to run so it's the localhost and I am going to say that I need only one thread right ok and I don't want to kill Nova ok it's a .

Very old machine but yeah so I'm just saying local one so let this startup min 1 let's go to the documentation and then say ok so if you see here this is this part which website so spark dot Apache a tautology and I am just in the latest documentation talk slash later so if you see here whatever Tamar diagram it is present here so you can also run sparker .

Interactively to molecular is no scholar this is a great way to learn the framework so this is how you run so master option specifies the master URL for a distributed cache so this is nothing but the URL for the cache so or even local tour on the locally with waterto yes so this is how you run one pretzel .

And the thing that my local is the master for the distributed cluster and I have only one thread that is what I have done yet so and this instance is going to be another scholarship so if you see I have started as as this commercial if you are expert in Python you can use Python shell as a tool okay go to see here the process has come up if you see .

This is the log of spark so it is just throwing something for loss failing to create a global database eternal fidelity section okay spark contacts will be available dirt something so it is saying that okay there is a review available for stock at this particular I don't know I have never touched until me there is something called spark context .

Available advocacy so this is what we saw in the diagram that if you recollect the diagram we said there will be a spark context which is under a driver and that will be managing different executors okay so that is nothing but this part context so and you can access that using s see so if you see here this is the Scala interactive shin so this .

Particular command like the partial as in Delaney got the Scala command-line interface so this is basically coming from Scala so since Parker's written in Scala so they have integrated this shell into Scala into spar fixham so that we can do some operations in this part directly here without even having to write a new .

Program or such so the next there is something called spark session available as path so you can use path or some spark systems so meanwhile let's go to the URL I don't know I haven't checked what this URL shows okay so this is just showing what all the different it's showing some event line so it okay it's just like showing we go okay this look .

At this should be executed should be treats all right but it's saying that is equal to Trevor are I'm not sure what it is mean maybe we can try running some example programs and I think these jobs should be showing some event even timing data okay so what I have done is I have gone through this documentation and I am just .

Going to the Quick Start Guide which has given me some examples of what I can do so for example they have these spark context which is DSC so using spark how can we do some operations on a file right so first as you know what we are going to do is be able to do the RTD so basically you have to load the data and then create an RTD okay that is what .

They are going to do here so what I'm going to do is there is a file called readme dot MD I have opened one more – yes okay and this is the same folder and I have already created a file called readme dot empty and it has a the content of hello YouTube okay so let's go to this so what we are going to do here is we are going to read the readme .

Dot md5 also there is the rain without emptying the previous folder we can actually we can go and create that file as well because that has more content so let's see it will work so what I have done is I just take we go to the command pan text file equal to SC dot txt file of whatever so this is a scholar syntax if .

You are not aware about Scala it's okay even I do not color it's just a similar to Java but it is slightly different than Java but let's for our case we are going to understand spot so in order to understand spot you need to understand scholars and so I would suggest you go ahead and learn Scala which I am also going to do that but before that to .

Understand what is Park and how you can do things in spa we can just do some basic scholar codes that should be okay so activity is readable right so we are using the spark context using your text file we are just reading a red meat or empty and storing it something called ekta fine so and if you'll notice here the long which is carbon has sent us or .

These parker synthesis it has created RTD ocean just rate V dot MD and it has created a map partition RTD one at expert so basically this text file is object is created now and we can use that x-file okay so what I'm going to do now next is you can if you see that life .

Cycle what we did what we said so first we have to get our DD then you have to transform the data then you have to do some action on the data so I am NOT going to do any transformation on the data and just go to say what is the count of the text file so basically this is going to return the size of the X file so if you see here it says long 1 0 .

4 so this text file is nothing but D Phi which is there in the previous directory Y spaces and creating issues ok if you see here this is going to return the size of the file it is there in this particular directory so this particular file if you will notice it is huge so this is nothing but the readme file given by Parker itself so that is what .

Our transformation or the action has resulted in ok so what we did is we just did an action and then this is now returned a result for it so this text file is now still there so this is nothing but our TD and it is immutable it is not going to change even if you change the content of the data this is not going to change I'll tell you we .

Recreate the RTT ok now let's do some transformation all right so let's do some transformation on this data so there are some samples which they have given already to see here in DSS is the documentation from the Apaches path website so what they are saying is do a text file code there is going to return the number of items in the RTD may be .

Different for your Sidney will change over time similar to the other outputs ok so that is what we did we just saw the count of the site so now next what we can do is we can get the first line in d5 so basically you can directly do that in scholar using the text file dot first so that is what we are going to do now dot first so this should return the .

First lines of 50 here Apache spark was the first line right so if you will notice this fight yeah opportunist Park with the hands for the first line and that is what we got here is what we caught it as a string okay that is the result so what we have done here is we have used this part to read the file transform and then finally .

Create an action and then we got the output so now let's do some complicated tasks right now let's do some transformation so if you notice here we have a text file okay we are going to use filter as an identify what are all the file lines which has spark in them okay so what are all the lines which has .

Spark in them so that is what we are going to do here so let me type this so I'm going to say texture file line I can create any object in the same text file and I'm using the texture file and I am saying filter okay filter every line there line dot contains spark so wherever there is spot this is going to be filtered okay so if you see there is .

An RDD created forces as well so there is an already created for text file and it is immutable same way now this is a dull minority over this okay now we can just say text file line okay this will just return the RTD because bit into any action on it correct so we did create only the already but we didn't do any action on it so we need to perform an .

Action and then only we will get the result so it is notice here this count and the first these are all actions so we can do some action over this okay so if you notice here count is a type of an action so if you see here count as a type of an action so we can do that so this should return the number of clients which are there this part if you see the .

Result is 20 so the number of lines which are there inside a text file lines is 20 okay so this is an action so now you know what is RTD right so our DT is the new table immutable object which gets created from the data which we extract basically so the that's why you are getting is negative so it's like at a .

Million distributed data set you can create multiple rdd's from a say marketing that is what we did here from a text file RGD we created text file 9 part ad and over that we just did some action to get some results so for example to it account and because some of the groups so that is an action and using this data you can either do – for .

What we call reporting or you can do store it again back into some other data source ok so that is what this discovers the whole lifecycle right so first what waited is the load at the file okay this is what waited and then be used RTD using the RTD we created some transforms some data here and then finally we created an action and then we saw the .

Data which got filtered okay so that's the whole lifecycle it is a their website a cue other examples there are some complex operations like using my app and stuff like that but you guys understood the concept right so this is how spark works and this is nothing but the interactive console of spark and you can even load Java programs in it but .

This is just the interactive shell which I am showing right now so right now I will not cover the Java loading part that I don't cover in the next video but this is this is what spark is all about and why spark is great right so you saw that how much time it took for me to write a spaz program it was very simple right and I just started the spark .

Console and then I just started writing already Gaspard tilted cool okay that's it for this particular video hope you guys understood what is the party spark and the internals of what it is doing and how it is doing video game in the next video until then thank you


Most Popular