perspective rendering (<--As far as I know, construct renders things with an isometric (no perspective) camera.)
Actually the 3D box object does render in perspective. If it were orthographic (I believe that's the word you're looking for), you wouldn't see the sides of the box angling away towards a vanishing point, there wouldn't be any foreshortening, and no depth cues except Z layering.
And yes, I think you've underestimated the tasks you've listed. The ability to load and render models with animation itself is a pretty tall order, let alone 3D collisions and whatnot. Hell, the 3D box doesn't even do 3D collisions. dfyb has a more realistic approach to the situation, and making 2.5D games like how he's suggesting is a much more realistic goal.
And even if you only had the ability to load an animated mesh, you could still fake real 3D with events anyway. Take a look at David's Wolfenstein demo... already it's on it's way to becoming a real FPS. I'm sure some clever person (Glamthaus ) will pop in there any second and show everyone how to make it mouse-look up and down, and after that how to move in the Y world axis. Arcticus has a pretty clever system for vertical movement with his orthographic game already, with some tweaking it could be adopted to render the world appropriately.