Saturday, May 21, 2022

E00: Software Drag Racing: C++ vs C# vs Python – Which Will Win?

I would write it in all three languages as best i could and then drag race them in a no holds barred no quarter ass none given battle royale of prime number generating madness sunday sunday sunday you better be there and if you're not there you better be dead or in jail and if you're in jail break out .

Hey i'm dave welcome to my shop i'm dave plummer a retired operating systems engineer from microsoft going back to the ms-dos in windows 95 days and today i had what i think is a cool idea write the same primes program in python c sharp and c .

Plus then show you exactly how the code differs between the languages before i benchmark and measure their actual performance how does highly optimized c stack up against the just in time approach to c sharp and is an interpreted language like python even a viable choice if perf matters .

Get a comparison tour of the languages and a performance showdown all right here today in dave's garage i went to sort of a weird high school back in my hometown of regina saskatchewan yes that's what i said and about the time i was getting ready to go to high .

School they were just putting together a new pilot program for gifted kids parents of precocious keeners around the city had their little prodigies tested and the top 20 applicants were accepted somehow i wound up included amongst this weird menagerie of freaks and geeks and spent the next couple of years immersed .

In an advanced curriculum there were no spares and no electives just lots of history calculus french and even a computer science requirement which was not a common thing back in 1982. when i was in the ninth grade i was very fortunate to have been blessed with a great computer science teacher we'll .

Just call him mr bright he was the perfect combination of hipster math teacher and doofus computer enthusiasts and one of the clever things he did was to make some of the assignments competitive amongst our class it was genius in a way because the bar becomes not some arbitrary grade but .

Rather the peers that surround you can you write code that runs faster than the other kids code instead of just trying to make your code fast enough you're pressured into making it as fast as possible which is a much different proposition particularly if you're competitive and doubly so if you're passionate about .

Computers as i was the assignment this time around was to calculate a list of all the prime numbers under one thousand as quickly as possible mr bright equipped us with a rudimentary explanation of how the civil war saw things works more on that later before he turned us .

Loose in our little lab full of commodore super pet sp9000s to write our prime number finders many students completely skip the seven just undertook the naive approach hoping for the best test each and every number up to a thousand by testing every possible factor .

If you find any that candidate number is therefore not a prime that approach works of course but it takes most of the hour to run to completion on a pet it was however enough to get you to pass if you got it working my own implementation was significantly faster coming in at well under one .

Minute just as i was hoping somebody would walk by and notice my amazing achievement i looked over the kid's shoulder next to me it was one of my best friends rob and my jaw dropped when i saw his result 10 seconds what how is that even possible on a commodore pet i thought mine was already .

Close to optimal so i was really surprised but his results were correct so there was no denying it there was only one kid i was worried about even more than rob and that was mike i started to wonder if his could be even faster somehow so i glassed across the al to where he was setting to scope out .

His next run and when it came in at six seconds i knew i was in trouble my cad and presumably still has some enormous intellectual horsepower now i was really into coding but mike was a serious math head and he clearly either known or figured out some optimizations of the sieve that i .

Had not optimizing a bad algorithm is a terrible strategy so it was back to the drawing board to look for the big wins in my algorithm and not the little fractional ones in my code it turns out that a sieve while a fairly simple concept can be undertaken in a very efficient manner or in a more naive .

One in the naive case you could spend a lot of time simply excluding even numbers or searching the high end of the candidate list for more primes slowly one by one i discovered possible optimizations some minor some major when marking a factor through the sieve for example .

You can skip every even multiple of the working prime because it would have been eliminated previously anyway on the two pass in fact you don't even need to keep track of the even numbers in your sieve at all which saves space and time and you can stop searching for factors when you've reached the square root of .

The upper limit the result of all these optimizations and a few others was that in the end all three of us plateaued at around six seconds i think it was a pretty safe bet that we were doing the same amount in style of work as each other at that point and odds are that the algorithms were pretty .

Close to optimal or if nothing else just like the old space shuttle computers on a good day our computers were in agreement the overall lesson was clear however no amount of optimizing can fix a bad or broken algorithm a couple of years ago i started writing a project for rgb led control .

Long story short i would compose a visual image on a pc and then send it over wi-fi to be displayed on led panels since the visuals are all time-stamped you can even use it to broadcast and then arrange a big jumbotron if you wished wanting to learn something new at the time i opted to do it in python which i .

Had never used the first thing i write in a new language is almost always that same venerable prime sieve as i wrote back in grade 9. i know the programming pattern well enough now that i can focus on the specifics of implementing it within whatever new language or platform i'm .

Working on the civ itself covers enough programming topics that by the time i'm done i've covered some i o memory allocation loops control structures and bit manipulation so naturally that was the first thing i wrote in python as well i went at the python with a bit of a bad .

Attitude because i didn't think i would be able to adjust to a language that used indentation for flow control that just seemed very wrong but i got used to it quickly enough and soon developed quite an admiration for the language i was particularly impressed or so i thought at the time with the execution speed given that it's .

An interpreted language but then i had not composed the prime sieve on a modern piece of hardware in at least a decade or two so i didn't really have a lot of basis for comparison that's why when it came time to write this performance comparison between c c sharp and python i decided that the .

Prime civ was the obvious choice not only is it something i knew how to write but as i said it covers a good many of the language features and is truly one of the classic cpu performance benchmarks it's been almost 40 years but odds are i can still remember how to write a sieve and most of those important .

Optimizations before i jump on into the editor and start throwing code around however we should talk a bit about what a prime sieve is and how it works as i'm sure you know a prime number is simply a number that can only be divided evenly by itself or one there are no other factors to .

Find the numbers that qualify all we have to do is to write down all the numbers from 1 to some upper limits like 1000 in a long list we start with 2 the first prime number and go through our list crossing out every multiple of 2. when done we look at our list again and .

Find the next number that is still available and that's 3. thus we mark 3 as a prime and then go through our list marking off all multiples of 3 as not prime when done we go back to our list to discover that the number 4 was already ruled out when we crossed out our multiples of 2. .

That leaves the number 5 as the next factor to walk through the list we keep pulling the first available number out of the list as the active factor for each pass and then crossing off its multiples in the list until we get to the square root of our upper limit which is as big as we need to go the proof of that is left to the reader as .

An exercise when we're done any numbers that have been crossed off the list are your primes it's like magic but remarkably simple magic all you need to remember are the two main steps find the next unmarked factor at the beginning of the list then cross off all .

Its multiples all the way down the list that's really all there is to it let's jump on into the visual studio code editor and take a look at the python variant of our program and see how it works being perhaps the most readable of the languages it's the best place to start our main entry has a few main tasks .

First it records the start time by calling the system's default timer then as long as less than five seconds have elapsed on the test we run another primitive and count it as a pass when the five seconds are up in addition to validating that our primes are actually correct the code will display the number of .

Passes it achieved how long that took and the average pass time those two numbers how many passes it could complete and how long each individual pass took will be our important performance metric i made the program able to use any multiple of 10 as your upper limit i later found that the sweet spot for .

Working in python was to cut the primed up to about a million because that takes about a fifth of a second for fun i ran it with a limit of 1 000 to see how it compared to the old commodore patch since both basic and python are interpreted languages even with all the numerous .

Simplifications and limitations of the old pit it still took six seconds versus one ten thousandth of a second that's a performance factor of approximately fifty thousand to one versus the old hardware we've come a long way baby working from the bottom up we can see .

That the print results function does pretty much exactly what you expect except that it knows how to validate the results by checking a table with historical data for example the code knows that there are 78 498 primes under 1 million so if you find any more or less you've got a bug .

And print results calls validate results to find out and we'll let you know count primes is obviously the function that counts how many primes are still in the list without having been crossed off it does it in python by summing a query which runs over the list of bits to determine which ones are still set in fairness to .

Python this is cool but i should note that if the language provided a concise or highly expressive way to accomplish what i was doing like this i'd use that mechanism without too much concern for performance mostly because i'm not a python expert by any stretch and by the way if you happen to be one and stuff i'm .

Doing in my python makes you cringe by all means let me know nicely please in the comments i'm eager to learn what i've done wrong and i'm completely new to the language when i wrote this i was writing with an eye towards making the algorithm as efficient as possible but not necessarily my language use .

The run sieve method is where the actual work gets done there are two loops here the outer while loop is the one that finds the next unmarked factor at the beginning of our list and the main for loop is the one that runs through the list crossing off the multiples you can see that we actually start with .

Multiples of three and that's important my code is optimized to not even keep track of the multiples of two they're not even in the list at all so there's no need to run the pass for the factor 2 and hence we start with the factor 3 right off the top each time we enter the 4 num in range code we'll look for the next .

Number at the beginning of the list that hasn't had its bit set to false yet as soon as it finds one it assigns that as the working factor once it has a factor it runs from there on to the end of the list clearing the bits of subsequent multiples thereby crossing them off the list you'll notice we start with the third .

Multiple of the factor which is a small optimization because we know implicitly the second multiple is an even number and therefore not present in the list we stop from there by doubling the factor for the same reason and even multiples are obviously even numbers and would have been crossed out already .

The get bit and clear bit functions are what manipulate the bits in the bit array to keep track of which numbers are prime and which have been crossed out and they only have a little bit of magic they divide the index by two when it comes in which means it uses half as many bits because it knows it doesn't even have to check .

Even numbers if you do happen to call it with an even number getbit returns false as it should and clearbit does nothing both are still supported but will cause debug output to let you know that you're doing something boneheaded if you're monkeying with even multiples it's a sure tell that you've got a major .

Inefficiency somewhere in your algorithm finally up at the top of the file we have a hash table or dictionary of historical data where i've encoded the number of primes that should be found up to various limits all the way up to 100 million the validate results function takes your upper limit and compares how many primes .

You found in your past to how many actually exist and returns a boolean value to indicate whether or not your code is working correctly or not note that i'm being careful to launch the app outside the debugger in the normal console without debug information the release build is about 20 times faster than the debug version so that's .

Something you'd definitely want to check on any performance sensitive python projects that you might have five seconds later we get the output that confirms the code is operating correctly 78 498 primes were found up to our valid limit of one million the results are valid it managed to run .

25 passes in slightly over five seconds for an average pass time of two tenths of a second now that you've got a feel for the algorithm in a higher level language let's step down a bit into c sharp i actually prefer it myself but that's likely a function of my own familiarity .

Had i been using python for as long as i've been using c-sharp i'd probably feel differently if we start down at the bottom of the file at the main entry again we can see that the c-sharp and the net runtime provides some very plain and expressive concepts i'd argue that the timing code here is .

Simpler in c-sharp than in python and easier to read all you do is subtract the start time from now and look at the total seconds taken otherwise this is by and large the exact same logic as our python program so let's just move on print results is similarly similar so we .

Can bypass it and move right onto the sieve itself except for the syntactic sugar of how you call a math routine such as the square root the sieve also seems almost identical to what we saw in python i think that tells you a lot about what c sharp is it's like c .

And has the power of c but it's almost as expressive as many interpreted languages the implementation of get bit and clear bit are pretty trivial they simply call the indexer on the bit array object they use the same divide the index by two approach as all of our implementations to omit the even numbers .

And save memory in the constructor for primesave we see evidence what makes our overall implementation so simple it's the system's bitter a class we are completely oblivious as to how it's implemented or how much memory it consumes to do its actual work but the interface is perfect so i went with .

It it's an array of bits what could be simpler we're at the mercy of its implementation as to how performant it will be but that's why we're measuring this stuff so here we go our python implementation cranked out 25 iterations in the five seconds let's see what c-sharp can do .

2 651 passes put more simply c-sharp is about 100 times faster than python now i knew it'd be faster but i was not expecting a factor of 100. that led me to poke around in the disassembly to see exactly how it was fetching and clearing bits since that's a lot of the work .

This is the debug version i couldn't actually figure out how to make the retail version of a c-sharp app that i could then debug and see the actual intermediate assembly language but if you look at it it is reasonably accurate it is or reasonably performant to divide .

By two it's shifting right the register by one that kind of thing that you would expect it to do so what accounts for this huge difference a factor of 100. well if you think about it when the c-sharp app goes to actually set a bit or actually it doesn't ever sets a bit it goes to clear a bit so when it goes .

To clear a bit all it does is it figures out how many bytes in the memory buffer it needs to be which bit within that buffer it needs to monkey with and then it just does a bit mask into it that's really all it does looking at this code and so that's pretty easy for it to do .

On the other hand the python code has to for each line of code it has to read and then interpret and tark you know it may it's probably got a lot of optimization so it's not re-tokenizing and re-parsing every line each time but even so the lines have to be parsed and put into byte codes or however .

Python actually encodes that's a lot of overhead compared to just a couple assembly instructions that push the bits around when it's an interpreted language a lot of the work goes through doing the interpreting so as a gas i'd expect this c sharp to be about half as fast if that .

As the hand coded native x86 should be if that turns out to be true and it's half as good as hand coded c that speaks volume is about the c sharp compiler which doesn't compile your code until the first time it's run before that it's stored in byte codes that represents an intermediate form of assembly language known as il .

For intermediate language so the compiler on the developer's machine compiles the c sharp code into sort of an idealized assembly language that has opcodes for things like allocating and initializing objects the things that might be really platform specific the first time the program is run the .

Jet or the just in time compiler compiles that intermediate language down into real machine code native to your machine on whatever platform you're running on and it only happens the first time so that's a small perfect in doing so on the very first run but you can't even take that hit at .

Installation time if you wish making it all sort of a non-issue of sorts the other perfect you can see is in c-sharp garbage collection rather than having the programmer keep track of when memory needs to be freed and where and so on the operating system is left to do so on .

A multi-core machine the framework can almost do it for free because it can happen on a background thread that wouldn't otherwise be busy but it's not a big factor for us because we only allocate memory once per civ pass i consider tweaking an algorithm to feature a reset method rather than .

Reallocating that entire array but i didn't want to cater to any one language any garbage collection that does happen is counted within the stats because it happens while the timer is running if garbage collection is a problem for c sharp it will show up in the numbers and it performed pretty well but we've .

Got to compare it to native code yet so finally the c plus plus while c sharp might be my favorite language c is more like my native tongue i've been working in it literally since i was a teenager and suffice to say well that was a while ago it's easy for programmers to get stuck in a rut using what they know but i've tried to keep my .

Own c plus plus relatively up to date without adding too many of these shiny gigas that turn up in the language as it evolves if you're a c programmer and you've never seen it before you're going to thank me for this little bit of duration measurement hang on let me find the code .

In the standard chrono namespace you'll see i'm using a duration cast never used that before to convert the duration to seconds and then counting the number of seconds when it comes time to print the results i need much better than single second resolution so i do a duration cast two microseconds .

Then take the count and divide it by a million to get a fractional second if we jump up to the run civ method you might be surprised and perhaps somewhat relieved by just how much it looks like the c-sharp and the python implementations the logic is pretty much identical because all of the real work has been .

Pushed down into the get bit and clear bit methods and that's where c plus plus stumbles a bit because there's no bit array class that meets our needs there is a bit array class in the c library but it requires a static definition of how many elements are going to be in the array at compile time with the other two languages i accept .

The upper limit as an argument and while i could bake it in as a constant i don't think that was a fair concession to that library's limitations when i was a div manager at microsoft i occasionally would give the get clear set bit in an array problem as an interview question .

Over the years it's seen fewer and fewer candidates were really familiar and comfortable with bit manipulation so i stopped using it i should still be able to explain it myself however the get bit function divides the index by 8 to find out which byte in the array the bit in question is going to live .

Within it then builds a bit mask where the remainder of the division is used to shift a bit into place that mask gives us individual access to the anth bit in the array and we simply return whether that bit is set or not by using a mask and an and clear bit uses similar logic .

Except the mask is inverted the form one where all the bits accept the nth bitter set we then add that mask into the appropriate byte then it will clear the nth bit only leaving all the other bits around it intact you'll notice we use the tilde sign which is the complement operator to .

Build that mask the complement operator flips all the bits so we start with only one bit set and then complement it and that gives us that one bit only clear those two functions plus the mallet call to allocate the memory used to back the bit array give us all the functionality we need in .

Order to implement the sieve in almost the exact same manner that we did in every other language for performance i wasn't sure what to expect as after looking at the code generated for the c sharp code there wasn't all that much room for improvement my bit function should still compile .

Load a little more efficient than the c bit array class or so i hope 4 645 passes what if we took it one step further and went right to assembly language well from having looked at the generated code the way i wrote the c functions makes it pretty easy for the compiler to generate pretty much optimal machine .

Code and so there's really not any room that i could see for hand optimizations the machine code is very tight and very close to the assembly that i would hand write for the bit functions i do however have one more accidental trick up my sleeve and that's 64 bit i was using 32 bits because i assumed it .

Would generate smaller code tighter faster and since i didn't need any of the big pointers or address space i couldn't imagine a win from moving to 64 bit but you don't know until you look so i tested it instead of guessing and lo and behold here's what happens when you run the .

64-bit version seven thousand four hundred and thirty six passes with each pass taking less than seven ten thousandths of a second crazy fast and not at all what i was expecting good thing i tried it i also tried a few runs to 100 million to see how long that would take .

With python it found all 5 million plus primes under 100 million in about 16 minutes with 64 bit c plus plus it took one wait for it one-tenth of one second so clearly the results must be exponential in terms of the performance delta now this wouldn't be a youtube video .

Without a fancy chart so here's the chart i made of the relative performance of the languages of my machine which features the amd 3970x chip that's a 64 thread chip but this is a single threaded workload so there are significantly faster chips out there now in fact i do have an m1 mac here about the fastest single core chip out there .

If you're curious enough that you think it wants a comparison let me know in the comments the nice thing is i can still do all three languages thanks to support on the mac if you enjoyed the episode please let me know by giving it the old thumbs up and if you're not subscribed .

To the channel yet you have no idea the kinds of awesome sauce you're missing so click on that button right now ring the bell too and if i don't live up to your lofty expectations you can undo it later but you can't remember to not forget it later if you don't do it now i believe that's called pascal's wager .

Wherein he proved that your safest bet is to simply subscribe to dave's garage in the meantime in between time i hope to see you next time right here in dave's garage any performance separate why can't you say 39 70 it's not hard


Most Popular