   +----------------------------------------------------------------------+
   | HipHop for PHP                                                       |
   +----------------------------------------------------------------------+
   | Copyright (c) 2010 Facebook, Inc. (http://www.facebook.com)          |
   +----------------------------------------------------------------------+
   | This source file is subject to version 3.01 of the PHP license,      |
   | that is bundled with this package in the file LICENSE, and is        |
   | available through the world-wide-web at the following url:           |
   | http://www.php.net/license/3_01.txt                                  |
   | If you did not receive a copy of the PHP license and are unable to   |
   | obtain it through the world-wide-web, please send a note to          |
   | license@php.net so we can mail you a copy immediately.               |
   +----------------------------------------------------------------------+


/******************************************
 * |1| What is tainting / taint analysis? *
 ******************************************/

Tainting (wikipedia:Taint_checking) is a mechanism used for detecting data
that poses a security threat (i.e., any user-controlled data), tracking this
data across all codepaths, and ultimately preventing or flagging dangerous
uses (e.g., echos, queries, shell commands). Tainting is generally used to
preempt XSS injection attacks, either by failing to execute things like curl,
eval, or exec when run on strings controlled by the user or, more
realistically for an imperfect codebase and with more ubiquitous calls like
echo, by providing valuable debugging information dynamically. However, the
mechanism of tainting, which simply involves attaching small amounts of
metadata to target variables and propagating this metadata through code
execution, can be extended to other similar purposes.

In HPHP (with taint analysis enabled), only PHP strings are given taints and
thereafter tracked. This "taint" comes in the form of a small sequence of
bit flags, which can be multipurposed and used for various semantics.

Note that only tainting strings comes with some disadvantages. In particular,
malicious code designed to ignore taint could destroy it, for example, thusly

  $s = '';
  for ($i = 0; $i < strlen($tainted_string); $i++) {
    $s .= chr(ord($tainted_string[$i]));
  }

In general, it's hard to prevent malicious code from clearing taint, so the
utility of tainting is dependent on the lack of a desire in developers to
undermine security.

For more HPHP tainting details, see |3| Taint analysis in HPHP.


/***********************************************
 * |2| How to build HPHP with string tainting. *
 ***********************************************/

Compiling string tainting into HPHP simply requires that you set a flag in
your local/$USER.mk file:

  TAINTED = 1

Then from $HPHP_HOME, run

  make clean
  make -C src -j

This will build hphp and hhvm with taint enabled. Running an hhvm server
or compiling code and running in server mode is done the usual way.


/******************************
 * |3| Taint analysis in HPHP *
 ******************************/

// 3.1: HPHP string types

In addition to the typical char * and std::string representations of strings,
HPHP has a complex type system which is used both internally and to parallel
PHP strings. The implementations of these strings are mostly restricted to
the following files (all pathnames relative to $HPHP_HOME/src):

  runtime/base/type_string.{h,cpp}
  runtime/base/string_data.{h,cpp}
  runtime/base/util/string_buffer.{h,cpp}
  runtime/base/type_variant.{h,cpp}

Unfortunately for implementors of taint semantics, there is no perfect mirror
for PHP strings. The String class is used to represent all PHP strings, but
is also used for various other internal purposes; moreover, String is merely
a wrapper for the StringData class which actually carries the raw char * as
well as the fundamental string operations, and StringData is used as the
data source for even more varied sorts of strings, such as StaticString.

For simplicity, taint bits are carried by only the StringData and StringBuffer
classes (the latter of which is a completely separate string implementation
used to optimize combinatory operations on large numbers of strings). However,
since these two types are used for every string type in HPHP (sometimes both
are used in the process of constructing a string), to ensure proper tainting
semantics, caution must be taken.

Taint, however, is a property that is generally associated with variables
rather than with data. A design more conducive to taint semantics might
include taint in the variable types (String and Variant) rather than in the
data types, but the perf impact of this has not been examined. Treating taint
as a write operation and performing copy-on-taint is another idea, but also
has not been investigated.

Instrumenting taint in StringData proves more fickle. For instance, the
following PHP code

  <?php
  $a = 'foo';
  $b = $a;
  fb_set_taint($b, TAINT_ALL);
  echo fb_get_taint('foo', TAINT_ALL);

will echo value because String types share StringData until modified. In
practice, this particular situation is more or less irrelevant to actual
use of the taint system, but it does highlight one of the design quirks of
the system.

A brief overview of HPHP string representations:

  <String>: The String class is HPHP's C++ parallel to PHP strings. However,
      it is also used internally and mostly acts as a wrapper for StringData.
      Relatively little work for taint is done (or should be done) from this
      class.

  <StringData>: The meat of HPHP's strings. The semantics of when taint is
      set, read, and propagated is implemented here, and the taint bits
      themselves are also embedded in these objects. The implementation in
      StringData ensures that tainting happens and happens correctly. For
      cases and string representations that should not involve taint, we
      need to do our work from a higher level, such as the wrapper classes.

  <StaticString>: Global strings that are only used internally. The entire
      set of StaticStrings is generated before any request begins. Tainting
      is inactive during this period, and none of these strings are user-
      influenced, so we do no work in this class. Though they do wrap
      StringData, these strings will never be tainted because they do not
      mutate after their creation time. Of import is the fact that these
      strings DO NOT represent static or literal PHP strings, which do not
      have their own specialized representation.

  <StringBuffer>: A completely separate implementation of strings. Like
      StringData, it carries a raw char * and string ops, as well as taint
      bits; however, it is not wrapped. It maintains semantics of taint much
      like StringData.

  <Variant>: A representation of a variable with unknown or readily changing
      type. Like String, it wraps StringData (although only sometimes) and
      hence involves little to no taint work.

// 3.2: The TaintObserver class

The TaintObserver class was the result of evaluating several possible
implementations of string tainting. It is essentially a thread-local,
singleton object with a notion of scope that absorbs, processes, and
propagates taint information at the level of string access, creation, and
mutation.

The critical advantage of this approach is that we don't need to convert the
entire HPHP codebase to be taint-aware. The TaintObserver allows us to keep
the mechanism of tainting restricted to the StringData class, while allowing
us to declare different policies and semantics for tainting in as many and
as specific parts of the codebase as we please. It also allows us to continue
using raw char * operations in the Zend codebase.

The TaintObserver has three core methods:

  ::TaintObserver: The class's constructor. Construction of TaintObserver
      objects is restricted to the stack, and the TaintObservers contain
      an m_previous pointer such that they form their own stack (this is
      a reasonably common C++ paradigm) with an implicit pop operation
      that happens on frame destruction. Meanwhile, instantiating a new
      TaintObserver is an explicit push op, and this TaintObserver will
      become the active instance until either another TaintObserver is
      pushed or until it goes out of scope with the rest of its resident
      frame, at which point the previous TaintObserver takes over. Note
      that the default case is that no TaintObserver is in scope, which
      means that no taints will be created or destroyed. This is surprisingly
      important because of how TaintObservers propagate taint, discussed
      below.

      The TaintObserver stack is the mechanism that allows us to apply
      specific policies to specific regions of HPHP. When we instantiate
      a TaintObserver, it is often the first one on the stack and serves
      the purposes of initializing taint tracking. For instance, when
      concatenating two strings, we invoke a TaintObserver, which will
      absorb the taints of the strings we concatenate, which will ultimately
      propagate to the new StringData object due to calls made within the
      StringData methods. This TaintObserver will die as soon as we leave
      the scope in which we wanted the tainting to occur, ensuring that
      we don't over-taint.

      Meanwhile, the TaintObserver also contains set_mask and clear_mask
      fields, which when set with certain bits, will cause those bits to be
      forcibly set or omitted from all mutated or created strings while that
      TaintObserver is active. If neither are set, propagation occurs
      normally.

  ::RegisterAccessed: Called whenever a StringData or StringBuffer's raw
      char * data is accessed. This causes the active TaintObserver to
      absorb that object's taint.

      It is important that we call this on every access of data which will
      lead to the formation of a new String, which is the case with most
      calls to String::data(). However, for operations like find() or same(),
      we want to avoid tainting because the string will not propagate, and
      hence neither should the string. External compare operations, however,
      should be using the internal find(), same(), and related methods, and
      so only the data() call should be called from outside the String class
      itself.

  ::RegisterMutated: Called whenever a new StringData or StringBuffer
      object is created or whenever an existing object's data is mutated.
      All of the taints that the active TaintObserver has absorbed will
      fall through to the new/mutated string.

Note that both Register methods rely on externally declared TaintObservers
in order to do anything at all. This allows us to restrict when we should
pay attention to taint (e.g., when following a PHP codepath) and when we
don't (e.g., internal operations). To do this, we modified idl/base.php
to allow for customization of taint semantics - from the idl files, we can
choose whether a function instantiates a TaintObserver, and what the set_mask
and clear_mask should be.

The versatility of the TaintObserver, while allowing us all the above boons,
comes with a painful price. One such price is that the scoping of the stack
of TaintObservers is sensitive to errors. If a TaintObserver is accidentally
instantiated, it will amass all taints of all strings it touches and propagate
it to all newly created strings. For instance, if a TaintObserver were
instantiated in call_user_func_array(), after falling through to the new PHP
codepath, after reading a tainted string, all subsequent strings, including
literals or even constant names, will become tainted with the read taint.
The StringData class unquestioningly makes TaintObserver calls, so we must be
careful about where we do and do not instantiate TaintObservers , which
will determine whether those StringData calls do any real work. Except when
we are on the user boundary (i.e., in extension functions or in concats),
we probably don't want TaintObservers around.

The issue of scoping also highlights the other major disadvantage of the
TaintObserver system, which is that TaintObservers have a fundamentally local
and restricted understanding of taint. A TaintObserver blindly absorbs all the
taints of strings that are accessed while it is active and will blindly
propagate the sum of all those taints to any created or mutated strings.
This leaves us no options for, say, ignoring the taint of the delimiter
string passed to an explode().

Fortunately, though, both of these issues only ever lead to false positives
at the time of taint detection, so neither are losses in security. This is
especially good because tainting is fundamentally a blackslist system; we
should err heavily on the side of false positives when possible.

// 3.3: Existing taints

The following taint semantics are implemented.

  - TAINT_BIT_HTML: This bit represents whether or not a string is safe for
      echoing to a page. Unlike the SQL and SHELL bits, this bit is not unset
      in the PHP builtin sanitization functions (htmlspecialchars() and
      htmlentities()) to allow for external control (e.g in XHP).

  - TAINT_BIT_MUTATED: This bit allows us to determine dynamically whether
      or not a string is a literal or a concatenation of literals. This is
      done by having the mutation bit set by all extension functions and on
      all strings that come from the user. It serves as a good example of
      a taint which uses the taint mechanism but does not server the typical
      purpose of taints.

  - TAINT_BIT_SQL: This bit represents whether or not a string is safe for
      MySQL queries. It is unset in mysql_real_escape_string().

  - TAINT_BIT_SHELL: This bit represents whether or not a string is safe for
      shell execs. It is unset in escapeshellarg() and escapeshellcmd().

Currently, strings are marked as tainted whenever they come out of user-
provided parameters, MySQL queries, or Memcache fetches.

// 3.4: Taint tracing

The tainting system also includes a mechanism for tracing the propagation of
taint. Taint already represents a negative impact on perf; performing trace
with any level of useful information creates an enormous footprint. For this
reason, the taint tracing system is heavily gated with exposed functions that
allow it to be used for very targeted debugging purposes.

Currently, we support two sorts of traces: a TRACE_HTML, which tracks the
propagation of HTML tainted strings, and TRACE_SELF, which is a method of
tracking specific strings as they move across the codebase. Both of these
traces share the same trace mechanism. We instrument trace by attaching trees
of stack traces and source strings to any strings under certain conditions.
These trace trees are aggressively shared between tainted strings and the
trace tree nodes do not mutate once they are created (we construct the tree
upwards from the leaves and create new nodes when we pass subtrees from
tainted strings to the TaintObserver, which collect traces along with
taints).  The strings that represent the actual trace information are
meanwhile hashed in a RequestLocal set to enable more sharing.

For TRACE_HTML, the condition for generating a trace is that the current
TaintObserver has encountered both an HTML-tainted and an HTML-untainted
string while in scope. We then drop the TRACE_HTML bit so that our trace
has first-encounter-with-taint semantics. For TRACE_SELF, we propagate the
bit and generate a new trace for each time we propagate.

Semantics for traces are configurable beyond their base definitions; for
example, when tracing the HTML bit, we add new traces whenever a source
string first is exposed to an untainted string, but we can skip tracing or
delay the trace further upstream by configuring TAINT_OBSERVER() calls.
Similar functionality could be instrumented for TAINT_SELF but currently is
not.


/******************************
 * |4| PHP tainting interface *
 ******************************/

In order to provide extensibility to the HPHP tainting implementation, the
taint constants, along with methods to get, set, and unset taint, namely

  bool fb_get_taint(string s, int taint);
  void fb_set_taint(string s, int taint);
  void fb_unset_taint(string s, int taint);

are exposed as PHP extensions.

Meanwhile, the HTML trace can be activated using

  void fb_enable_html_taint_trace()

A system for checking if suppressed warnings were generated also exists.
Currently, we simply tally warnings that were suppressed, but we could
additionally keep other data, such as the outputted string. This is
accessible as an array indexed by taint, retrieved using

  Array fb_get_warning_counts()
