OPT-1 Inline DECREF operations when the reference count is >1

Background

CPython’s reference counter uses two C macros, Py_INCREF and Py_DECREF. Py_INCREF is simple, because it increments the ob_refcnt field of PyObject* by 1. For Pyjion, any opcodes which use Py_INCREF, instead fetch the current refcnt by calculating the offset of the ob_refcnt field to the PyObject* address. This is then incremented by 1:

LD_FIELDA(PyObject, ob_refcnt);
m_il.dup();
m_il.ld_ind_i4();
m_il.ld_i4(1);
m_il.add();
m_il.st_ind_i4();

The machine-code for this function is efficient, as it requires a small operation on a qword (calculate offset), an increment on an I4 (add 1) and storing back to the address.

The INCREF emitter is used by both LOAD_CONST (load frame const value) and LOAD_FAST (load name).

Py_DECREF however, needs to call the object’s dealloc function (tp_dealloc slot) to free any fields, properties, etc. before free’ing the memory allocated for the object. For example, the dealloc for a tuple will decrement the refcounts for all items within the tuple (and potentially dealloc any of the items).

Decref is more significant, because it is used by many opcodes:

  • DELETE_FAST

  • IS_OP

  • POP_TOP

  • UNPACK_SEQUENCE

  • STORE_FAST

CPython optimizes this by inlining the decref method:

static inline void _Py_DECREF(
#ifdef Py_REF_DEBUG
    const char *filename, int lineno,
#endif
    PyObject *op)
{
#ifdef Py_REF_DEBUG
    _Py_RefTotal--;
#endif
    if (--op->ob_refcnt != 0) {
#ifdef Py_REF_DEBUG
        if (op->ob_refcnt < 0) {
            _Py_NegativeRefcount(filename, lineno, op);
        }
#endif
    }
    else {
        _Py_Dealloc(op);
    }
}

Without this optimization, Pyjion will call a method, PyJit_Decref() which calls the Py_XDECREF macro (same as Py_DECREF but checks null).

This operation is slow because:

  • It requires a global method call (requires a jump pointer on Windows and Linux as well)

  • x86_64 preamble/ABI adds overhead

Solution

The OPT-1 optimization will instead implement a simple CIL bytecode sequence, similar to the incref emitter, equivalent to this C code:

if (obj == NULL) return;
--op->ob_refcnt;
if (op->ob_refcnt <= 0)
{
    _Py_Dealloc(op);
}

By reusing the reference to obj and dup’ing the value, the entire operation can be handled by CIL evaluation stack (and therefore fit within registers):

Label done = emit_define_label();
Label popAndGo = emit_define_label();
                                // -- EE Stack Effect --
m_il.dup();                     // obj, obj
emit_null();                    // obj, obj, null
emit_branch(BranchEqual, popAndGo);

m_il.dup(); m_il.dup();         // obj, obj, obj
LD_FIELDA(PyObject, ob_refcnt); // obj, obj, refcnt
m_il.dup();                     // obj, obj, refcnt, refcnt
m_il.ld_ind_i4();               // obj, obj, refcnt, *refcnt
m_il.ld_i4(1);                  // obj, obj, refcnt,  *refcnt, 1
m_il.sub();                     // obj, obj, refcnt, (*refcnt - 1)
m_il.st_ind_i4();               // obj, obj

LD_FIELD(PyObject, ob_refcnt); // obj, refcnt
m_il.ld_i4(0);                 // obj, refcnt, 0
emit_branch(BranchGreaterThan, popAndGo);

m_il.emit_call(METHOD_DEALLOC_OBJECT); // _Py_Dealloc
emit_branch(BranchAlways, done);
emit_mark_label(popAndGo);
emit_pop();

emit_mark_label(done);

Gains

  • The POP_TOP opcode is now ultra-efficient. In particular this would speed up breaking from a block and an exception handling region.

  • The IS_OP opcode (a is b or a is not b) combined with OPT-2 Optimize “is” operations to pointer comparisons is much more efficient that the CPython equivalent.

Edge-cases

This optimization would remove support for Py_REF_DEBUG, the reference counting debug option. This option isn’t compiled into the release binaries of CPython and requires --with-pydebug support (only used for testing). In those scenarios, this optimization should be disabled.

Configuration

This optimization is enabled at level 1 by default. See Optimizations for help on changing runtime optimization settings.