My solution would be to listen to the audio, possibly slowed down, and manually pick out the word boundaries. I would try and markup the text with timestamps, Alternately put each word of the text in a 2d array, with one column being a single word, and the next column being the start time of that word. However you get to this point, you can now check the audio time, and use it to determine where in the text you are.
Have you played with the Text object? Is it dynamic enough for your purposes? I don't think it supports any kind of markup, so I think the highlighted word needs to be a separate text object. Have a look through the completed plugin list though, as there may be a more featureful addon text object, and using some kind of markup would be much easier than separate text objects.