OPT-1 Inline DECREF operations when the reference count is >1#
Background#
CPython’s reference counter uses two C macros, Py_INCREF
and Py_DECREF
. Py_INCREF
is simple, because it increments the ob_refcnt
field of PyObject*
by 1.
For Pyjion, any opcodes which use Py_INCREF, instead fetch the current refcnt by calculating the offset of the ob_refcnt
field to the PyObject*
address. This is then incremented by 1:
LD_FIELDA(PyObject, ob_refcnt);
m_il.dup();
m_il.ld_ind_i4();
m_il.ld_i4(1);
m_il.add();
m_il.st_ind_i4();
The machine-code for this function is efficient, as it requires a small operation on a qword (calculate offset), an increment on an I4 (add 1) and storing back to the address.
The INCREF emitter is used by both LOAD_CONST
(load frame const value) and LOAD_FAST
(load name).
Py_DECREF
however, needs to call the object’s dealloc function (tp_dealloc
slot) to free any fields, properties, etc. before free’ing the memory allocated for the object.
For example, the dealloc for a tuple will decrement the refcounts for all items within the tuple (and potentially dealloc any of the items).
Decref is more significant, because it is used by many opcodes:
DELETE_FAST
IS_OP
POP_TOP
UNPACK_SEQUENCE
STORE_FAST
CPython optimizes this by inlining the decref method:
static inline void _Py_DECREF(
#ifdef Py_REF_DEBUG
const char *filename, int lineno,
#endif
PyObject *op)
{
#ifdef Py_REF_DEBUG
_Py_RefTotal--;
#endif
if (--op->ob_refcnt != 0) {
#ifdef Py_REF_DEBUG
if (op->ob_refcnt < 0) {
_Py_NegativeRefcount(filename, lineno, op);
}
#endif
}
else {
_Py_Dealloc(op);
}
}
Without this optimization, Pyjion will call a method, PyJit_Decref()
which calls the Py_XDECREF
macro (same as Py_DECREF
but checks null
).
This operation is slow because:
It requires a global method call (requires a jump pointer on Windows and Linux as well)
x86_64 preamble/ABI adds overhead
Solution#
The OPT-1 optimization will instead implement a simple CIL bytecode sequence, similar to the incref emitter, equivalent to this C code:
if (obj == NULL) return;
--op->ob_refcnt;
if (op->ob_refcnt <= 0)
{
_Py_Dealloc(op);
}
By reusing the reference to obj
and dup’ing the value, the entire operation can be handled by CIL evaluation stack (and therefore fit within registers):
Label done = emit_define_label();
Label popAndGo = emit_define_label();
// -- EE Stack Effect --
m_il.dup(); // obj, obj
emit_null(); // obj, obj, null
emit_branch(BranchEqual, popAndGo);
m_il.dup(); m_il.dup(); // obj, obj, obj
LD_FIELDA(PyObject, ob_refcnt); // obj, obj, refcnt
m_il.dup(); // obj, obj, refcnt, refcnt
m_il.ld_ind_i4(); // obj, obj, refcnt, *refcnt
m_il.ld_i4(1); // obj, obj, refcnt, *refcnt, 1
m_il.sub(); // obj, obj, refcnt, (*refcnt - 1)
m_il.st_ind_i4(); // obj, obj
LD_FIELD(PyObject, ob_refcnt); // obj, refcnt
m_il.ld_i4(0); // obj, refcnt, 0
emit_branch(BranchGreaterThan, popAndGo);
m_il.emit_call(METHOD_DEALLOC_OBJECT); // _Py_Dealloc
emit_branch(BranchAlways, done);
emit_mark_label(popAndGo);
emit_pop();
emit_mark_label(done);
Gains#
The
POP_TOP
opcode is now ultra-efficient. In particular this would speed up breaking from a block and an exception handling region.The
IS_OP
opcode (a is b
ora is not b
) combined with OPT-2 Optimize “is” operations to pointer comparisons is much more efficient that the CPython equivalent.
Edge-cases#
This optimization would remove support for Py_REF_DEBUG
, the reference counting debug option. This option isn’t compiled into the release binaries of CPython and requires --with-pydebug
support (only used for testing).
In those scenarios, this optimization should be disabled.
Configuration#
This optimization is enabled at level 1 by default. See Optimizations for help on changing runtime optimization settings.