Improving the bubblesort.

sao123 · Feb 11, 2005

Hey guys,
I've been spending a few days to see if I could improve the bubblesort, and heres what I came up with.

I'd like some comments/thoughts on the theory, and as soon as i get working code I'll post it for someone to test, and see 1st if it actually works and 2nd how much faster than bubblesort it really is.

Improved BubbleSort Theory
Edit: Here is the code.
Code File

Matthias99 · Feb 11, 2005

I mean, there's a lot of ways to optimize bubblesort (I gotta think about this one a little more; your description is a bit confusing, and your code is indecipherable -- single-letter variable names and no comments are not good). But why? A naive O(nlogn) sort (like quicksort or merge sort) will beat even a highly-tuned O(n^2) sort for any meaningfully large n.

sao123 · Feb 11, 2005

The reason behind the excercise was a question posed by blahblah99 awhile ago.
The question was... If i had an unsorted array of 10 million items, what is the best/fastest way to pull out the 10 largest numbers in sorted order.
The answer was to just sort the 10 largest numbers with a bubblesort. Rather than sorting the entire array with a faster algorithm.

Now he has asked again if its possibly to do it even faster.
This proposal is my idea, to improve the bubblesort, and it will still take 10 passes.
however, instead of taking 10 full passes, i believe due to void areas in the array, its possible to take just a few or even just 1 full passes and the rest would be partial passes. It would have the effect reducing the amount of compare and swaps performed per pass. thus speeding up the algorithm.

Jack31081 · Feb 11, 2005

Well, in one pass of the 10million item array you can pull the 10 largest items. then it's just a matter of sorting them, say with a quicksort. Or did the problem prohibit pulling the 10 largest items from the larger array until you had them in order?

*edit*

oh, that's what you just said. didn't read that right the first time.

so you're trying to speed up the sorting of 10 items? isn't that kind of a futile task?

Matthias99 · Feb 11, 2005

What he said. Do one (O(n)) pass to find the x largest elements of the n-element array. Then sort those x elements using an O(xlogx) sort. This is O(n + xlogx), which is almost always going to be lower than O(n*x) (which is what your repeated bubblesort is). For n = 100 and x = 10, it's ~130 versus ~1000 operations (at least with a naive sort used for both). The repeated bubblesort gets much worse as x gets bigger, whereas the other algorithm scales up a lot better. If x = 100 and n = 1000000, O(n + xlogx) ~= 1,000,000 (xlogx ~= 700 operations, which is insignificant), whereas O(n*x) = 100,000,000. That's a hundred-fold improvement.

You'd have to do something like you proposed to do it in-place, though, since otherwise you need O(X) extra storage to hold the indices of the X largest elements so you can sort them.

Edit: I thought about this a little more, and I'm not sure you can find the x largest elements of an n-element array in O(n) time. I'll have to look at this later when I have more time to go over it.

Jack31081 · Feb 11, 2005

well, if the list of 10million is already in order from largest to smallest, it would require 99,999,945 comparisons and no replacements in the list of 10. If the list of 10million is already in order from smallest to largest, it would require 9,999,999 comparisons and 99,999,945 replacements.

If I remember my algorithms correctly, that means the first case is O(10n) -> O(n) and the second is O(n+10n) -> O(11n) -> O(n). A randomized list should fall somewhere in the middle, no? Meaning for any list of 10 million items, the largest ten can be found in O(n) time.

right? it's been awhile, so I'm not certain.

sao123 · Feb 11, 2005

10million is already in order from largest to smallest....that means the first case is O(10n) -> O(n)

This is true for an unmodified bubblesort.

Its not built into the code for the prototype, (becasue i wanted to test my new optimization by itself) but.... one of the first modifications to a bubblesort was In any given pass of the bubblesort... if 0 swaps are made, then the sort is complete and no further passes need occur. That would give O(1N)->O(n). and that makes sence, because if the array is already sorted from largest to smallest... the first 10 numbers are the ones you are looking for. Minimum 1 guaranteed pass.

At first look it doesnt appear any of my optimizations improve the WC scenerio... which will be O(x*n). x=10. which reduces to O(n).

The goal is to whenever possible... reduce the length of a given pass by...
After a given pass, there may be a region of the array that does not need to be searched because the largest value in that region has been moved to the smallest index in that region. So if Largest Number N in a given pass had index X... if N1 (the next largest value in the array) had index > X, it was moved to position X+1 during that same pass. So the index of N1, is guaranteed < X+1. Thus reducing the length of the next pass.

Jack31081 · Feb 11, 2005

Hey, wait a sec. The pass I just described in my previous post will give you the top 10 numbers in order one pass through the 10million item list, meaning O(n) time. Are you trying to make the algorithm less than O(n) or just reducing the actual number of comparisons and swaps?

Matthias99 · Feb 11, 2005

Okay, you definitely can't find the x largest elements of an n-element array in O(n) time. You *can* do it in O(n*x) time, but that's not an asymptotic improvement.

What I was looking at was doing a single pass over the n-element array, and building/maintaining a sorted list of the x highest elements encountered so far. If you use a linked list and use insertion sort (and then remove the lowest-numbered element from the list of x elements), it's O(n + n*(x/2)), which is O(n*x). Maybe if you maintained it as a heap/priority queue you could get O(n + n*logx)? Or would you end up with O(n+ n*(xlogx))?

I'm thinking maybe O(n*x) is as good as it's getting.

sao123 · Feb 11, 2005

Since... 1 pass through the array is O(n) and 10 passes through the loop are
O(10n)->O(n)... You can't do better than O(n).
Matthias99 is right... its really O(x*n) if x is small it reduces to O(n), but if x is large, it reduces to O(n^2).

What I really am doing though is just reducing the actual number of comparisons and swaps, performed in a given pass, and hopefully speeding up actual completion times for average cases. No matter what it will still be an O(n) algorithm.

Code is broken. Will fix and reupload.

Jack31081 · Feb 11, 2005

I thought about it some more, and you can reduce the number of comparisons as follows:

as you're making your pass through the 10million item array and comparing each item to items in the 'topten' array, don't start at the top and compare against all ten items. compare with the largest item, then compare with the smallest, then compare with item 5, then item 7 or 3, etc. This should cut down the number of needed comparisons a good deal i imagine.

hoppa · Feb 11, 2005

Its quite obvious just glancing at the code that your sort is still O(n^2). Whats the point? It'll still perform worse than balls on a big array, and on a small array the performance distance is too minimal to matter. plus the code is congested and hard to read. Nice try though. Check out quicksort for all your sorting needs.

-andy

sao123 · Feb 11, 2005

Its quite obvious just glancing at the code that your sort is still O(n^2). Whats the point?

If you sort the entire array obviously its going to be O(n^2), because thats what bubblesort is. If you sort only the 10 largest elements.... its O(n).

The code is broken and needs fixed.
Edit: Found the bug and uploaded new code.

hoppa · Feb 11, 2005

for the pulling X highest out of N question...
How about a merge sort where:
1) Merges sort from highest to lowest, rather than lowest to highest
2) During a merge, after the highest X elements have been merged into the array, all other elements are thrown out

I'm having trouble figuring out the complexity of that solution but it is surely less than nlogn for any x < n. Or perhaps not. Obviously if x = n for ANY solution you will still be stuck with O(n log n), which kind of makes this whole problem useless. How about we rephrase the problem to:
Find the highest X numbers in an N sized array, s.t. X <= log(n)
Even still, the solution caps out at O(n log log n).

HMMMM ok I just realized everything I typed up there is wrong because I was mistating the problem, though I'll leave it cause it still applies for that problem. If the numbers we need to remove don't have to be sorted than this problem is I think more difficult and potentially has a faster solution. I feel like there must be a way to scan the array, anazyle the data from the scan, and have a cutoff point of which you remove all numbers higher than for your solution. Any ideas on how to analyze? Some method with standard deviations, perhaps.

-andy

hoppa · Feb 11, 2005

Originally posted by: sao123

If you sort the entire array obviously its going to be O(n^2), because thats what bubblesort is. If you sort only the 10 largest elements.... its O(n)

Sao, that's fine but that still only applies to X = 10. If you want this for any X, which is a million times more useful, than it is still O(n^2).

sao123 · Feb 11, 2005

The code is finished, debugged and fully commented.
(Use Original link in OP)

Matthias99 · Feb 12, 2005

HMMMM ok I just realized everything I typed up there is wrong because I was mistating the problem, though I'll leave it cause it still applies for that problem. If the numbers we need to remove don't have to be sorted than this problem is I think more difficult and potentially has a faster solution. I feel like there must be a way to scan the array, anazyle the data from the scan, and have a cutoff point of which you remove all numbers higher than for your solution. Any ideas on how to analyze? Some method with standard deviations, perhaps.

-andy

I'd thought about this as well; this is a probabilistic approach to the problem -- these can get lower than O(x*n) runtime, but are nondeterministic (so it could run for longer than O(x*n) as well).

Pick some threshold value, and then do an O(n) pass where you choose all elements with value greater than that threshold (this is one compare each for n items, so it is O(n)). If the number of elements you get is less than X, lower the threshold and repeat. If the number of elements is WAY bigger than x, raise the threshold and repeat. If the number of elements is close to x (say, x <= num_of_elements <= 10*x), run an O(xlogx) sort over those elements and just strip out the x largest ones.

If you have some idea of what the threshold should be, this could be a big time-saver -- it's O(p*n + xlogx), where p is the average number of passes required to find the right threshold. It wins out over the deterministic algorithms if p < x.

Matthias99 · Feb 12, 2005

as you're making your pass through the 10million item array and comparing each item to items in the 'topten' array, don't start at the top and compare against all ten items. compare with the largest item, then compare with the smallest, then compare with item 5, then item 7 or 3, etc. This should cut down the number of needed comparisons a good deal i imagine.

You can also store what the largest and smallest items you have found so far are. If the new item is <= to the smallest you have found yet, and you have looked at more than x items, it obviously can't be one of the x biggest ones. If you build a sorted list as you go, you can further optimize by checking if it is larger than the biggest item you have found yet, and if so, just insert it as the 'big' end of the list and then remove the item at the 'small' end (which is O(1), rather than O(x)). This reduces to O(n) for presorted lists, while still maintaining O(n*x) for unsorted ones.

You normally cannot have random access to items in a list format. If you store your x largest items in a sorted array, you can binary search to find where the new item should be inserted (O(logx)), but then actually inserting it is O(x), since you have to move the old items around.

blahblah99 · Feb 13, 2005

Originally posted by: hoppa

Originally posted by: sao123

If you sort the entire array obviously its going to be O(n^2), because thats what bubblesort is. If you sort only the 10 largest elements.... its O(n)

Click to expand...

Sao, that's fine but that still only applies to X = 10. If you want this for any X, which is a million times more useful, than it is still O(n^2).

Not neccessarily... if you read and play with the sorting algorithms, each one has its own disadvantages and advantages depending on the nature of the application, and the randomness of the array (guassian, random, reversed sorted, etc).

For my application where I only wanted the ten largest elements in a huge array, the obvious non-recursive sort algorithm of choice was bubblesort since by nature, after N passes through the array, the first N elements are going to be the largest.

Now I'm no CS major, but making 10 passes through a 1+million array using bubblesort to get the ten largest elements was expensive and unacceptable. So, I modified the algorithm slightly so that it's roughly O(n), and it's surprisingly simple, and very fast, AND leaves even more room to optimize.

The psuedocode goes something like this:

1. Create 10 element sorted array with initial values being 0.
2. Step through each element in the array.
3. If big array element[n] > last element in sorted array, replace last element in sorted array with element[n].
4. Re-sort 10 element array.
5. Repeat #2 until you go through all elements in the big array.

Now, I know some of you are going to argue that this is only going to work for N (10 in this case) largest elements... and that's EXACTLY all I need for this application. As N increases, this algorithm becomes grossly inefficient. For my application, on average there will be about 0.5*sizeof(big array) sorts on the 10 element array, and that can be sorted very quickly with just about any algorithm.

The reason I say there's room for improvement is because in step #4, after the first 10 elements are inserted into the sorted array, every other element inserted will only replace the last element. That means that the last element will be the only element unsorted in that array. Again, using bubblesort only requires 1 pass through the array to resort that sorted array.

ghackmann · Feb 13, 2005

That's not O(n). Worst case, somebody hands your algorithm an n-element array with all the elements already sorted in ascending order. Your algorithm will have to do a sort on the x-element array every step through the n-element array. That's O(nx log x) if you use a decent sorting algorithm, or O(nx^2) if you use bubble sort.

imported_java · Feb 13, 2005

to get the ten largest values. don't sor the whole array. just grab the ten largest values in one pass. start at one end put first 10 values in a sorted array. Keep going through the first array replacing the child array when larger values are found. don't sort 10 million values if u don't have to.

sao123 · Feb 13, 2005

I'd still like to know the timing of my algorithm on your array...
vs the full bubblesort method for 10 passes... which gave ~17S
I think i cut that time by 1/3 - 1/2.

blahblah99 · Feb 13, 2005

Originally posted by: sao123
I'd still like to know the timing of my algorithm on your array...
vs the full bubblesort method for 10 passes... which gave ~17S
I think i cut that time by 1/3 - 1/2.

Havn't tried your code, but I took a look at it and it would require me to re-write it in the language that I'm using, and I'm too lazy to do that .

With the improved sort (shown above), my 1 million element array pops out the 10 largest elements in ~0.7s, and only makes 1 pass through the entire array.

her209 · Feb 14, 2005

Originally posted by: blahblah99
The psuedocode goes something like this:

1. Create 10 element sorted array with initial values being 0.
2. Step through each element in the array.
3. If big array element[n] > last element in sorted array, replace last element in sorted array with element[n].
4. Re-sort 10 element array.
5. Repeat #2 until you go through all elements in the big array.

Actually for worst case scenario (the 1 million-element array is sorted in ascending order in this case), your algorithm would be very inefficient as every number would have to be added and sorted to 10-element array. I think I may have found an even faster algorith. I'm working on the pseudocode and will post it later.

Matthias99 · Feb 14, 2005

Originally posted by: her209

Originally posted by: blahblah99
The psuedocode goes something like this:

1. Create 10 element sorted array with initial values being 0.
2. Step through each element in the array.
3. If big array element[n] > last element in sorted array, replace last element in sorted array with element[n].
4. Re-sort 10 element array.
5. Repeat #2 until you go through all elements in the big array.

Click to expand...

Actually for worst case scenario (the 1 million-element array is sorted in ascending order in this case), your algorithm would be very inefficient as every number would have to be added and sorted to 10-element array. I think I may have found an even faster algorith. I'm working on the pseudocode and will post it later.

I already posted pseudocode for getting around this problem above. Maintain a sorted list rather than an array, so that if the new element is larger than everything already in the sorted list, it's O(1) to add it.

Improving the bubblesort.

Lifer

Diamond Member

Lifer

Member

Diamond Member

Member

Lifer

Member

Diamond Member

Lifer

Member

Senior member

Lifer

Senior member

Senior member

Lifer

Diamond Member

Diamond Member

Platinum Member

Member

Junior Member

Lifer

Platinum Member

No Lifer

Diamond Member