We need to make sure that we execute the code
corresponding to the actual Func*. This re-uses the
prolog array to hold the entry points for the cloned
Func*.
This adds a new profiling mode for vector instruction
shapes. I'm planning on using this to identify any common cases that
may be worth special casing, like we do for simple SetM, CGetM, and
IssetM instructions. Instead of adding Yet Another Hashtable Stats
Map, I created a generic version and changed TRACE=punt:1 to use it as
well.
I also changed emitInterpOneOrPunt to use a more specific name.
Fairly straightforward. The only interesting part is the
addition of ElemUX, which behaves similarly to ElemDX in terms of
side-effects (but not exactly the same).
TranslatorX64 interps cases that it expects to fail but I
didn't like that, so the hhir version never punts or interps. The fast
path code reuses the stuff Jordan added for InstanceOfD and should be
the same as tx64's, modulo some branch fusing that doesn't appear to
matter in perflab.
We are inconsistent with how many spaces happen after ##:## Errors have one, but warnings and notices have 2. I went with 1 because that is what Zend does.
Post-hphpc, declaring builtins as inline is useless, which obviates the
need for this layer.
This doesn't fully delete the file, because I wanted to keep mindless
parts and use-your-brain parts separate. This is the mindless part. The
remaining functions in noinline.cpp are no-inline wrappers around
(a) a function that isn't defined in an extension, call_user_func_array,
and (b) calling polymorphic is_* functions, which can probably mostly go
away now. But I wanted to exercise more care around those, so they'll
come in a followup diff.
The code was assuming that the result of the assignment
would be the value assigned, but its actually a new string
containing the first character of the value assigned. The result
was that the SetM had already decRef'd the rhs, and then the
jitted code decRef'd it again. Since that last decRef was a decRefNZ,
we would often get away with simply leaking both the original rhs and
the value of the SetM. But the new testcase crashes in a DEBUG build
without the fix.
Fixing various bugs all over the VM that make assumptions about RefData
and TypedValue layout. Here are the assumptions fixed by this diff:
offsetof(RefData, m_tv) == 0. Both JIT's assumed this in many subtle
ways, by punning RefData* as TypedValue* without adding an offset.
This assumption also causes RefData._count to overlap TypedValue.m_aux,
which constraints TypedValue layout.
offsetof(TypedValue, m_data) == 0. gen_ext_hhvm.php assumes you
can cast TypedValue* to Value*; the JITs often weren't using
offsetof(TypedValue, m_data) in their addressing calculations. HHIR
assumed return-by-value TV's have m_data/m_type in rax/rdx, which
can change when TV layout changes.
offsetof(TypedValue, m_type) > 8 is an assumption baked into the
pass-by-value register assignment logic in HHIR's codegen.cpp; if
the type is in the low word, register assignment is swapped.
sizeof(TypedValue::m_type) == 4. We used dword-sized operations
in both JIT's when accessing m_type. Now, we use helper functions
that are sensitive to sizeof(DataType)
Configuration:
DEBUG=: (opt) same layouts as trunk for RefData & TypedValue
DEBUG=1: (dbg) new RefData layout (m_tv doesn't overlap RefData::_count)
PACKED_TV=1, DEBUG=*: new RefData and TypedValue layout.
I've found several bugs recently due to the TypedValue plausible
check being too loose. Our DataType enum values are now sparse,
and we miss garbage that happens to fall inside the [-10, 127]
range, which is >50% of possible garbage values.
The vector translator produced a type that was
boxed-array-or-string, which we couldnt guard on.
Since applying a SetM to a string will almost always produce
a string, change it to report string instead (which will still
be guarded on).
But it was recently made thread local. If the value left
in it at the end of one request was smart allocated, the
next request would try to free it - but after its already
been swept.
We could make this RequestLocal, but that seems to add even
more overhead. We could probably make it a __thread TypedValue,
and just set it to uninit at the start of each request (and then
tvAsVariant() where its used). This just puts it back to how
it was, since global_variables() is already being dealt with
appropriately.
Provide a Static{Result,Exception}WaitHandle::Create() static method for
internal use, use it by GenArrayWaitHandle.
Avoids one inc/decref (Array -> Variant -> m_resultOrException) if
GenArray received an array of finished dependencies.
posix_getpwuid()
posix_getpwnam()
posix_getgrgid()
posix_getgrnam()
posix_ttyname()
were using non-threadsafe posix calls.
Most using AttachLiteral for shared space as well.
Bertrand's profiling data showed we're wasting the first 3
cache lines of a. Move defClsHelper and the (now rarely used)
non-generic decref stubs to astubs. Also adds some tracing I was
using to see how big and where things are.
Merge ContDone+ContExit into ContRetC with added support of passing results. This variable passing mechanism is not exposed to the PHP as the ReturnStatements in generators do not contain result expression. However, this is exposed by restored hphp_continuation_done() built-in to allow experimentation.
The idea is that once we introduce ContYield opcode (merge of all opcodes used by YieldExpression), we could change ContRetC and ContYield to leave result and done-status on the stack and leave it up to the caller (ContNext/ContSend/ContRaise) to fill in Continuation fields. This will make these opcodes more generic and useful for other things, while allowing us to move some properties to the VM and kill opcodes like ContCurrent.
Split up ConvToDbl into ConvArrToDbl, ConvBoolToDbl, ConvIntToDbl,
ConvObjToDbl, ConvStrToDbl and ConvGenToDbl, so that different flags
can be set for each instruction.
Continuation::send() uses m_received field to transfer the value inside
the continuation. This field remains set until the continuation is
iterated the next time.
Unset this field early by ContReceive so that the memory can be
reclaimed.
VectorTranslator is now good enough that all of this code can be
removed without causing a perf regression. Instructions, loads, and stores are
all slightly up, but CPU time looks like noise.
ZendPack::unpack() always returns a signed int32_t,
regardless of actual storage type being unpacked. For 32bit
unsigned types ('L', 'N', 'V', and 'I'(on I32 systems)) this
means overflowing in the helper and we have to explicitly
recast it to a uint64_t to get the data back out.
For LNV, this was already handled, it was missed for I.
Of the four horsemen of the SmartAllocator, ArrayData was the only virtual
call. This meant an extra layer of indirection when coming from the TC
to allow the c++ compiler to emit its virtual call, and slightly larger
callsites when using non-generic paths.
While we're moving in this direction, consolidate ArrayData introspection
on its type enum. isSharedMap() was previously implemented with a vtable
slot, and we had no way of asking if an ArrayData was a NameValueTable.
If a GenArrayWaitHandle is imported, the current implementation imports
only the active child. Try to import other children as well to increase
parallelism.
enterContext() can throw an exception if a cross-context dependency
cycle is found. It is safe to ignore such exception when importing
non-active children. The import will be attempted and fail again once
onBlocked() reaches the dependency that causes the cycle.
Tx64 only translates CGetS when you do it with the context
class the same as the class being looked up, to avoid the need for
accessibility checks. The IR can translate it more often, but was
using its fast path even when it was not safe to do so: if a previous
(safe) access had allocated a targetcache entry, later accesses would
be able to get to the property without an access check. For now just
limit the optimiation to safe cases.
Verify sanity of input array of dependencies in advance. Callers get the
error earlier and it also makes it easier to work with the array outside
of the main loop.
Our ext_hhvm generated code is casting TypedValue* to Value*
on the assumption that the offset of TypedValue::m_data is 0.
Fix this assumption, and also while in the same code, replace
some (t == KindOfString || t == KindOfStaticString) with
IS_STATIC_STRING(t), which does a single bit test instead of
two comparisons.
Many destructors were going through a C++ trampoline that did
nothing but turn a C call into a method call. Get rid of these where
possible. ArrayData still uses a virtual release, which is unfortunate
but cannot be helped at the moment.
change callers of determine_charset to check for nulls, instead of calling html_supported_charset (to pre-validate charset name) and then calling this (which has to run through a list of names anyway).
A few simple refactorings and optimizations for array_data. Perflab results (CPU time %) are on the desired side of noise: 22/23 show reduced times, 12/23 green, no red, best -2.2%, worst +0.3%.
In an unlikely situation a user of ext_asio may tamper with Continuation
before passing it to the ext_asio extension. Let's fail with an
exception if this happens. Previously, a user bug would stay unnoticed,
but would not harm the ext_asio code.
This check will be needed once we implement optimistic execution. With
optimistic execution, we iterate continuation and defer construction of
ContinuationWaitHandle until the first blocking event occurs. During
this phase, a standard dependency loop detection is skipped and the code
would try to iterate a continuation that is already being iterated.
I got to do a few optimizations:
* burn in the number of use vars
* not write out uninitialized use vars
* use the staticness of the method as a hint for the incRef
There is not a clear path forward to safely using MMX registers as
scratch storage. Doing so spoils the state of the legacy x87 FPU, and the
x64 ABI uses the x87 FPU to implement long double. Our options are to
either:
1. Prohibit use of long double in source code (or x87 instructions in
machine code). Given that we have a dynamically linked binary that
includes system libraries outside our control, open source that needs
to be patched, fbcode, etc., this could prove difficult.
2. Always execute an FPU resetting instruction before transitioning from
TC code to C++. For all I know, these instructions are cheap, but it
still seems like an unfortunate overhead to impose everywhere; since
we don't want to mark every helper with a "reset fpu" call, we'd
probably end up bloating callsites.
3. Admit this doesn't work.
In the absence of evidence this matters for perf, I lean towards 3.