363d1bb20f
This change is mostly for FB internal organizational reasons. Building is not effected beyond the fact that the target now lands in hphp/hhvm/hhvm rather than src/hhvm/hhvm.
335 linhas
16 KiB
Plaintext
335 linhas
16 KiB
Plaintext
+----------------------------------------------------------------------+
|
|
| HipHop for PHP |
|
|
+----------------------------------------------------------------------+
|
|
| Copyright (c) 2010 Facebook, Inc. (http://www.facebook.com) |
|
|
+----------------------------------------------------------------------+
|
|
| This source file is subject to version 3.01 of the PHP license, |
|
|
| that is bundled with this package in the file LICENSE, and is |
|
|
| available through the world-wide-web at the following url: |
|
|
| http://www.php.net/license/3_01.txt |
|
|
| If you did not receive a copy of the PHP license and are unable to |
|
|
| obtain it through the world-wide-web, please send a note to |
|
|
| license@php.net so we can mail you a copy immediately. |
|
|
+----------------------------------------------------------------------+
|
|
|
|
|
|
/******************************************
|
|
* |1| What is tainting / taint analysis? *
|
|
******************************************/
|
|
|
|
Tainting (wikipedia:Taint_checking) is a mechanism used for detecting data
|
|
that poses a security threat (i.e., any user-controlled data), tracking this
|
|
data across all codepaths, and ultimately preventing or flagging dangerous
|
|
uses (e.g., echos, queries, shell commands). Tainting is generally used to
|
|
preempt XSS injection attacks, either by failing to execute things like curl,
|
|
eval, or exec when run on strings controlled by the user or, more
|
|
realistically for an imperfect codebase and with more ubiquitous calls like
|
|
echo, by providing valuable debugging information dynamically. However, the
|
|
mechanism of tainting, which simply involves attaching small amounts of
|
|
metadata to target variables and propagating this metadata through code
|
|
execution, can be extended to other similar purposes.
|
|
|
|
In HPHP (with taint analysis enabled), only PHP strings are given taints and
|
|
thereafter tracked. This "taint" comes in the form of a small sequence of
|
|
bit flags, which can be multipurposed and used for various semantics.
|
|
|
|
Note that only tainting strings comes with some disadvantages. In particular,
|
|
malicious code designed to ignore taint could destroy it, for example, thusly
|
|
|
|
$s = '';
|
|
for ($i = 0; $i < strlen($tainted_string); $i++) {
|
|
$s .= chr(ord($tainted_string[$i]));
|
|
}
|
|
|
|
In general, it's hard to prevent malicious code from clearing taint, so the
|
|
utility of tainting is dependent on the lack of a desire in developers to
|
|
undermine security.
|
|
|
|
For more HPHP tainting details, see |3| Taint analysis in HPHP.
|
|
|
|
|
|
/***********************************************
|
|
* |2| How to build HPHP with string tainting. *
|
|
***********************************************/
|
|
|
|
Compiling string tainting into HPHP simply requires that you set a flag in
|
|
your local/$USER.mk file:
|
|
|
|
TAINTED = 1
|
|
|
|
Then from $HPHP_HOME, run
|
|
|
|
make clean
|
|
make -C src -j
|
|
|
|
This will build hphp and hhvm with taint enabled. Running an hhvm server
|
|
or compiling code and running in server mode is done the usual way.
|
|
|
|
|
|
/******************************
|
|
* |3| Taint analysis in HPHP *
|
|
******************************/
|
|
|
|
// 3.1: HPHP string types
|
|
|
|
In addition to the typical char * and std::string representations of strings,
|
|
HPHP has a complex type system which is used both internally and to parallel
|
|
PHP strings. The implementations of these strings are mostly restricted to
|
|
the following files (all pathnames relative to $HPHP_HOME/src):
|
|
|
|
runtime/base/type_string.{h,cpp}
|
|
runtime/base/string_data.{h,cpp}
|
|
runtime/base/util/string_buffer.{h,cpp}
|
|
runtime/base/type_variant.{h,cpp}
|
|
|
|
Unfortunately for implementors of taint semantics, there is no perfect mirror
|
|
for PHP strings. The String class is used to represent all PHP strings, but
|
|
is also used for various other internal purposes; moreover, String is merely
|
|
a wrapper for the StringData class which actually carries the raw char * as
|
|
well as the fundamental string operations, and StringData is used as the
|
|
data source for even more varied sorts of strings, such as StaticString.
|
|
|
|
For simplicity, taint bits are carried by only the StringData and StringBuffer
|
|
classes (the latter of which is a completely separate string implementation
|
|
used to optimize combinatory operations on large numbers of strings). However,
|
|
since these two types are used for every string type in HPHP (sometimes both
|
|
are used in the process of constructing a string), to ensure proper tainting
|
|
semantics, caution must be taken.
|
|
|
|
Taint, however, is a property that is generally associated with variables
|
|
rather than with data. A design more conducive to taint semantics might
|
|
include taint in the variable types (String and Variant) rather than in the
|
|
data types, but the perf impact of this has not been examined. Treating taint
|
|
as a write operation and performing copy-on-taint is another idea, but also
|
|
has not been investigated.
|
|
|
|
Instrumenting taint in StringData proves more fickle. For instance, the
|
|
following PHP code
|
|
|
|
<?php
|
|
$a = 'foo';
|
|
$b = $a;
|
|
fb_set_taint($b, TAINT_ALL);
|
|
echo fb_get_taint('foo', TAINT_ALL);
|
|
|
|
will echo value because String types share StringData until modified. In
|
|
practice, this particular situation is more or less irrelevant to actual
|
|
use of the taint system, but it does highlight one of the design quirks of
|
|
the system.
|
|
|
|
A brief overview of HPHP string representations:
|
|
|
|
<String>: The String class is HPHP's C++ parallel to PHP strings. However,
|
|
it is also used internally and mostly acts as a wrapper for StringData.
|
|
Relatively little work for taint is done (or should be done) from this
|
|
class.
|
|
|
|
<StringData>: The meat of HPHP's strings. The semantics of when taint is
|
|
set, read, and propagated is implemented here, and the taint bits
|
|
themselves are also embedded in these objects. The implementation in
|
|
StringData ensures that tainting happens and happens correctly. For
|
|
cases and string representations that should not involve taint, we
|
|
need to do our work from a higher level, such as the wrapper classes.
|
|
|
|
<StaticString>: Global strings that are only used internally. The entire
|
|
set of StaticStrings is generated before any request begins. Tainting
|
|
is inactive during this period, and none of these strings are user-
|
|
influenced, so we do no work in this class. Though they do wrap
|
|
StringData, these strings will never be tainted because they do not
|
|
mutate after their creation time. Of import is the fact that these
|
|
strings DO NOT represent static or literal PHP strings, which do not
|
|
have their own specialized representation.
|
|
|
|
<StringBuffer>: A completely separate implementation of strings. Like
|
|
StringData, it carries a raw char * and string ops, as well as taint
|
|
bits; however, it is not wrapped. It maintains semantics of taint much
|
|
like StringData.
|
|
|
|
<Variant>: A representation of a variable with unknown or readily changing
|
|
type. Like String, it wraps StringData (although only sometimes) and
|
|
hence involves little to no taint work.
|
|
|
|
// 3.2: The TaintObserver class
|
|
|
|
The TaintObserver class was the result of evaluating several possible
|
|
implementations of string tainting. It is essentially a thread-local,
|
|
singleton object with a notion of scope that absorbs, processes, and
|
|
propagates taint information at the level of string access, creation, and
|
|
mutation.
|
|
|
|
The critical advantage of this approach is that we don't need to convert the
|
|
entire HPHP codebase to be taint-aware. The TaintObserver allows us to keep
|
|
the mechanism of tainting restricted to the StringData class, while allowing
|
|
us to declare different policies and semantics for tainting in as many and
|
|
as specific parts of the codebase as we please. It also allows us to continue
|
|
using raw char * operations in the Zend codebase.
|
|
|
|
The TaintObserver has three core methods:
|
|
|
|
::TaintObserver: The class's constructor. Construction of TaintObserver
|
|
objects is restricted to the stack, and the TaintObservers contain
|
|
an m_previous pointer such that they form their own stack (this is
|
|
a reasonably common C++ paradigm) with an implicit pop operation
|
|
that happens on frame destruction. Meanwhile, instantiating a new
|
|
TaintObserver is an explicit push op, and this TaintObserver will
|
|
become the active instance until either another TaintObserver is
|
|
pushed or until it goes out of scope with the rest of its resident
|
|
frame, at which point the previous TaintObserver takes over. Note
|
|
that the default case is that no TaintObserver is in scope, which
|
|
means that no taints will be created or destroyed. This is surprisingly
|
|
important because of how TaintObservers propagate taint, discussed
|
|
below.
|
|
|
|
The TaintObserver stack is the mechanism that allows us to apply
|
|
specific policies to specific regions of HPHP. When we instantiate
|
|
a TaintObserver, it is often the first one on the stack and serves
|
|
the purposes of initializing taint tracking. For instance, when
|
|
concatenating two strings, we invoke a TaintObserver, which will
|
|
absorb the taints of the strings we concatenate, which will ultimately
|
|
propagate to the new StringData object due to calls made within the
|
|
StringData methods. This TaintObserver will die as soon as we leave
|
|
the scope in which we wanted the tainting to occur, ensuring that
|
|
we don't over-taint.
|
|
|
|
Meanwhile, the TaintObserver also contains set_mask and clear_mask
|
|
fields, which when set with certain bits, will cause those bits to be
|
|
forcibly set or omitted from all mutated or created strings while that
|
|
TaintObserver is active. If neither are set, propagation occurs
|
|
normally.
|
|
|
|
::RegisterAccessed: Called whenever a StringData or StringBuffer's raw
|
|
char * data is accessed. This causes the active TaintObserver to
|
|
absorb that object's taint.
|
|
|
|
It is important that we call this on every access of data which will
|
|
lead to the formation of a new String, which is the case with most
|
|
calls to String::data(). However, for operations like find() or same(),
|
|
we want to avoid tainting because the string will not propagate, and
|
|
hence neither should the string. External compare operations, however,
|
|
should be using the internal find(), same(), and related methods, and
|
|
so only the data() call should be called from outside the String class
|
|
itself.
|
|
|
|
::RegisterMutated: Called whenever a new StringData or StringBuffer
|
|
object is created or whenever an existing object's data is mutated.
|
|
All of the taints that the active TaintObserver has absorbed will
|
|
fall through to the new/mutated string.
|
|
|
|
Note that both Register methods rely on externally declared TaintObservers
|
|
in order to do anything at all. This allows us to restrict when we should
|
|
pay attention to taint (e.g., when following a PHP codepath) and when we
|
|
don't (e.g., internal operations). To do this, we modified idl/base.php
|
|
to allow for customization of taint semantics - from the idl files, we can
|
|
choose whether a function instantiates a TaintObserver, and what the set_mask
|
|
and clear_mask should be.
|
|
|
|
The versatility of the TaintObserver, while allowing us all the above boons,
|
|
comes with a painful price. One such price is that the scoping of the stack
|
|
of TaintObservers is sensitive to errors. If a TaintObserver is accidentally
|
|
instantiated, it will amass all taints of all strings it touches and propagate
|
|
it to all newly created strings. For instance, if a TaintObserver were
|
|
instantiated in call_user_func_array(), after falling through to the new PHP
|
|
codepath, after reading a tainted string, all subsequent strings, including
|
|
literals or even constant names, will become tainted with the read taint.
|
|
The StringData class unquestioningly makes TaintObserver calls, so we must be
|
|
careful about where we do and do not instantiate TaintObservers , which
|
|
will determine whether those StringData calls do any real work. Except when
|
|
we are on the user boundary (i.e., in extension functions or in concats),
|
|
we probably don't want TaintObservers around.
|
|
|
|
The issue of scoping also highlights the other major disadvantage of the
|
|
TaintObserver system, which is that TaintObservers have a fundamentally local
|
|
and restricted understanding of taint. A TaintObserver blindly absorbs all the
|
|
taints of strings that are accessed while it is active and will blindly
|
|
propagate the sum of all those taints to any created or mutated strings.
|
|
This leaves us no options for, say, ignoring the taint of the delimiter
|
|
string passed to an explode().
|
|
|
|
Fortunately, though, both of these issues only ever lead to false positives
|
|
at the time of taint detection, so neither are losses in security. This is
|
|
especially good because tainting is fundamentally a blackslist system; we
|
|
should err heavily on the side of false positives when possible.
|
|
|
|
// 3.3: Existing taints
|
|
|
|
The following taint semantics are implemented.
|
|
|
|
- TAINT_BIT_HTML: This bit represents whether or not a string is safe for
|
|
echoing to a page. Unlike the SQL and SHELL bits, this bit is not unset
|
|
in the PHP builtin sanitization functions (htmlspecialchars() and
|
|
htmlentities()) to allow for external control (e.g in XHP).
|
|
|
|
- TAINT_BIT_MUTATED: This bit allows us to determine dynamically whether
|
|
or not a string is a literal or a concatenation of literals. This is
|
|
done by having the mutation bit set by all extension functions and on
|
|
all strings that come from the user. It serves as a good example of
|
|
a taint which uses the taint mechanism but does not server the typical
|
|
purpose of taints.
|
|
|
|
- TAINT_BIT_SQL: This bit represents whether or not a string is safe for
|
|
MySQL queries. It is unset in mysql_real_escape_string().
|
|
|
|
- TAINT_BIT_SHELL: This bit represents whether or not a string is safe for
|
|
shell execs. It is unset in escapeshellarg() and escapeshellcmd().
|
|
|
|
Currently, strings are marked as tainted whenever they come out of user-
|
|
provided parameters, MySQL queries, or Memcache fetches.
|
|
|
|
// 3.4: Taint tracing
|
|
|
|
The tainting system also includes a mechanism for tracing the propagation of
|
|
taint. Taint already represents a negative impact on perf; performing trace
|
|
with any level of useful information creates an enormous footprint. For this
|
|
reason, the taint tracing system is heavily gated with exposed functions that
|
|
allow it to be used for very targeted debugging purposes.
|
|
|
|
Currently, we support two sorts of traces: a TRACE_HTML, which tracks the
|
|
propagation of HTML tainted strings, and TRACE_SELF, which is a method of
|
|
tracking specific strings as they move across the codebase. Both of these
|
|
traces share the same trace mechanism. We instrument trace by attaching trees
|
|
of stack traces and source strings to any strings under certain conditions.
|
|
These trace trees are aggressively shared between tainted strings and the
|
|
trace tree nodes do not mutate once they are created (we construct the tree
|
|
upwards from the leaves and create new nodes when we pass subtrees from
|
|
tainted strings to the TaintObserver, which collect traces along with
|
|
taints). The strings that represent the actual trace information are
|
|
meanwhile hashed in a RequestLocal set to enable more sharing.
|
|
|
|
For TRACE_HTML, the condition for generating a trace is that the current
|
|
TaintObserver has encountered both an HTML-tainted and an HTML-untainted
|
|
string while in scope. We then drop the TRACE_HTML bit so that our trace
|
|
has first-encounter-with-taint semantics. For TRACE_SELF, we propagate the
|
|
bit and generate a new trace for each time we propagate.
|
|
|
|
Semantics for traces are configurable beyond their base definitions; for
|
|
example, when tracing the HTML bit, we add new traces whenever a source
|
|
string first is exposed to an untainted string, but we can skip tracing or
|
|
delay the trace further upstream by configuring TAINT_OBSERVER() calls.
|
|
Similar functionality could be instrumented for TAINT_SELF but currently is
|
|
not.
|
|
|
|
|
|
/******************************
|
|
* |4| PHP tainting interface *
|
|
******************************/
|
|
|
|
In order to provide extensibility to the HPHP tainting implementation, the
|
|
taint constants, along with methods to get, set, and unset taint, namely
|
|
|
|
bool fb_get_taint(string s, int taint);
|
|
void fb_set_taint(string s, int taint);
|
|
void fb_unset_taint(string s, int taint);
|
|
|
|
are exposed as PHP extensions.
|
|
|
|
Meanwhile, the HTML trace can be activated using
|
|
|
|
void fb_enable_html_taint_trace()
|
|
|
|
A system for checking if suppressed warnings were generated also exists.
|
|
Currently, we simply tally warnings that were suppressed, but we could
|
|
additionally keep other data, such as the outputted string. This is
|
|
accessible as an array indexed by taint, retrieved using
|
|
|
|
Array fb_get_warning_counts()
|