I wrote loads of assembly programs, too, but only one stands out for crazy, stupid, and useless optimization.
I was trying to draw a spline on the screen. I took the algorithms from a scientific paper I got, and my C program was dead slow. The issue boiled down to solving an equation being the problem. I moved this subroutine from C to MC68K assembler and optimized the heck out of it, with no real change to the result: It still was dead slow. Whenever I changed a param, it redrew the line, and it took something like five or more seconds to do that. For a single spline.
So I dove down into the algorithm. What were they doing with that time-eating equation? Turned out this was about measuring the distance between two coordinates - and they did not even use Phytagoras for that, but something oddly complicated. I remember 15 or 16 MULS per call. With up to 70 clock cycles per MULS instruction, this burned.
I replaced this by a simple "is delta X and delta Y both in -1, 0, or +1 range" function, and suddenly the algorithm ran like a lightning bolt on steroides. I could move any defnition point around with the mouse, and the spline followed smoothly.
So it is nice to be able to optimize assembler, but with chosing the right algo, you can get way better than that.