These are only needed by the testsuite. If mydef is unavailable for some
reason, e.g. running autogen.sh from a testsuite-only tarball,
regeneration of the benchmarks will be skipped.
aligned_alloc requires the size to be multiples of alignment. We
add a round up for the size parameter. The posix_memalign requires
the alignment to be multiples of sizeof(void *), adding an assertion
to check.
Increase MPIR_CVAR_GPU_FAST_COPY_MAX_SIZE due to the improvement
of fast GPU copy by using stream load/store. For D2H, which is
used in send path when packing from send buffer to pack_buffer,
the threshold can be set higher to benefit from fast GPU copy.
We have multiple progress loops in various places. They can use the same
DEBUG_PROGRESS_ macros rather than redefine individually.
Also, add DEBUG_PROGRESS to posix release_gather progress loop.
Annotate sender requests as well since they may get stuck too due to
lack of receiver progress. Adding the info helps to trace out a more
complete messaging map.
Since MPIDIU_PROGRESS_START declares local variable, we need protect it
with an additional scope in case the macro is used multiple times within
a function.
For small messages, IPC adds the synchronization from receiver side,
which is not desirable. For example, applications may assume small
messages will be sent eagerly and have code with potential dead-lock
issue if that eager-assumption is not true.
NOTE: due to amortization of IPC memory registration cost, IPC path may
show performance benefit in micro-benchmarks even for tiny messages.
In practice, small buffers are often allocated from stack, thus are
less-applicable for the IPC benefit.
The PMIx API has more explicit nodemap construction functions we can
use. The decision whether or not to query the MPICH-style process
mapping string should come from the generic MPIR_pmi layer, not the PMIx
glue code. Instead, we translate the MPICH key "PMI_process_mapping" to
the PMIx format in pmix_get_jobattr.
1. Try to get process mapping string from PMI server. This is how Hydra
provides the information and is thus the preferred method for MPICH.
2. If using PMIx, use discovery functions. This is the preferred method
when using 3rd party PMIx libraries.
3. Use fallback method, i.e. putting and getting node,rank pairs via the
PMI KVS.
Define macro PMI_FROM_3RD_PARTY if linked with a 3rd party pmi
library such as cray, Slurm, and openpmix. They often run into issues
when quieried with nonexistent job attribute keys such as
PMI_process_mapping, PMI_hwloc_pmi, etc.
We have a few patches recently to workaround openpmix with non-existent
job attributes.
Cray PMI used to work, but we encountered a hang with PALS v1.3.4.
This commit skips querying these keys as job attributes to bypass these
potential issues. They are often not supported by 3rd party pmi anyway.
The ZE H2D copy, such as used in the recv path when unpacking from
received pack_buffer to original device buffer, can benefit from a
larger threshold due to higher cost using yaksa kernel.
Sender and receiver don't have to agree on the datatypes thus it can
easily result in deadlock restricting the sender to contiguous
datatypes.
Remove the contig condition seems work. It now passes the test
pt2pt/sendrecv1 2 arg=-sendmem=device arg=-recvmem=device
env=MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1.
I believe this is an optimization similar to MPL_gpu_fast_memcpy.
We are seeing memory corruption issue -- sometime resulting in free
invalid pointer in finalize, and sometime it results in hang. Removing
this optimization resolves those issue.
Since later in ofi_events.h we use MPIR_Localcopy_gpu, which will
perform MPL_gpu_fast_memcpy optimization for small contig data, I don't
think the mmap here is necessary. In any regard, the mmap should be
performed at the time of copy and preferably inside MPIR_Localcopy_gpu.
Always use MPL_gpu_attr_is_{dev,strict_dev} and the MPIR_GPU_
equivalents. The subtleties of non strict device buffer (e.g. ZE) is not
obvious. Using the attr query wrappers makes the semantics explicit.
* Rename MPL_gpu_query_pointer_is_{dev,strict_dev} to
MPL_gpu_attr_is_{dev,strict_dev} and remove the pointer argument.
Such queries should always be two-step calls -- first call
MPIR_GPU_query_pointer_attr, then call MPL_gpu_attr_is_{dev,strict_dev}.
MPIR_CVAR_GPU_ENABLE is reflected in MPIR_GPU_query_pointer_attr.
* Remove the attr argument in MPIR_GPU_query_pointer_is{dev,strict_dev},
thus make both pure pointer queries.
Add device memory support using mtest_common utilities. This will add
the dependency to utility libraries, which the makefile already
imports.
However, this will remove the simpliicity of building single
source with mpicc or mydef_run. If one doesn't need test device memory,
one can simply comment off "$include macros/mtest.def" to restore the
simplicity.
This check does not capture output (thus test results will show in
console log) and only checks for exit code - zero means success and
nonzero means failure.
We'll use this check for benchmark tests.
We could add rules to directly work with mydef code in Makefile, but
convert the code in autogen removes the mydef dependency.
Also fix a spelling error.
Add point-to-point benchmark code in MyDef. The tests have automatic
warm-ups and adjusts number of iterations for measurement accuracy.
It produces latency measurements with standard deviations and equivalent
bandwidths.
MYDEF_BOOT=[topsrc_dir]/modules/mydef_boot
export PATH=$MYDEF_BOOT/bin:$PATH
export PERL5LIB=$MYDEF_BOOT/lib/perl5
export MYDEFLIB=$MYDEF_BOOT/lib/MyDef
To run:
mydef_page p2p_latency.def # -> p2p_latency.c
mpicc p2p_latency.c && mpi_run -n 2 ./a.out
Alternatively use mydef_run (uses settings from config):
mydef_run p2p_latency.def
Next commit will add "make testing".
Since some launcher will hold console output, to make debugging progress
hang a bit easier, this commit makes the process abort on time out. We
delay the abort after first dump the stack backtrace to allow other
processes to also dump progress backtrace before killing them.
There is no need to hold datatype unless it is needed for
* recv unpacking, which is held in
MPIDI_OFI_REQUEST(req, noncontig.pack.datatype).
* recv iov type matching check, which is held in
MPIDI_OFI_REQUEST(req, noncontig.nopack.datatype).
* Use local iovs variable to avoid repeated using
MPIDI_OFI_REQUEST(sreq, noncontig.nopack).
* Declare local variables where it is first used.
* It shouldn't be necessary to zero the iovs array.