Here are the results of Otter 3.1 and MACE 1.4 on TPTP v2.3.0.

These are the versions of Otter and MACE that will be entered in the CASC-17 ATP System Competition.

Otter (which searches for proofs) and MACE (which searches for models/counterexamples) are separate programs. Some of the TPTP problems are satisfiable, so the two programs were combined with a shell script (otter-mace, that runs Otter for up to 5 seconds, then (if no proof was found) runs MACE for up to 5 seconds, then (if no counterexample was found) runs Otter for up to for the rest of the allowed time.

TPTP v2.3.0 has 4229 problems. Most have proofs (are unsatisfiable) and some have models. Otter-MACE found proofs for 1902 problems, and models for 194 problems, giving about 49.5% success rate (given 5 minutes for each problem) over the whole TPTP.

These figures were obtained in May 2000 on a (400 MHz PII) linux box.

The table has entries of the following form.

Problem Result Reason Seconds Memory Generated Size --------------------------------------------------------------------------- ALG002-1 PROOF --- 1 223K 1851 10 ALG008-1 MODEL --- 6 - - 3 ALG001-1 fail time 300 ??K ?? - --------------------------------------------------------------------------- Problem: the TPTP problem name Result/Model: PROOF, MODEL, fail Reason: reason for failure Seconds: time used Memory: memory used (applies only to Otter searches) Generated: clauses generated (applies only to Otter searches) Size: proof length or model size

Here is the table (gzipped plain text).

- The inputs were built with
tptp2X -f otter:hypothesis:[set(tptp_eq),set(auto),clear(print_given)] -t rm_equality:stfp

- This uses "auto1", the original auto mode. For the past few years, I have been using "auto2" for these TPTP experiments and for CASC (14,15,16). Auto2 is tuned a bit to TPTP. (The differences are mostly on nonHorn problems with equality, on which auto2 is less complete, but a bit more effective. Also, with auto2, you can partition the input clauses into Usable and Sos.)
- I switched back to auto1, because that's what I use in real life. It has no tuning to TPTP.
- The preformance of Otter in its auto modes has changed very little since before the first CASC competition (CASC-13 in 1996).
- There are a bunch of "abend" (abnormal end) failures in the SYN group. These are caused by the huge input formulas. I compiled a special version of Otter to handle these formulas, but it couldn't refute any of them anyway.