Just to enhance upon the answer here - I'd recommend using the internal judge.
The reason it was created is because it'll give the most accurate results. There won't be any cached responses from the server and you can guarantee that it'll remain up and running whilst you're performing the test.
If you're having problems in starting it up - then the same engine, along with some extra debugging, has been incorporated into the standalone program Stan
http://www.project2025.com/stan.php
There are very few cases where the internal judge can't be used - the majority of "failures" are down to incorrect configuration.