20090 Commits

Autor SHA1 Mensagem Data
Hui Zhou 551708826d Merge pull request #7174 from hzhou/2410_posix_prog
ch4/posix: fix made_progress in MPIDI_POSIX_progress_send

Approved-by: Ken Raffenetti
2024-10-16 13:09:54 -05:00
Hui Zhou 9afd5cc3b1 ch4/posix: fix made_progress in MPIDI_POSIX_progress_send
Only update made_progress when sending the deferred operation is
successful.
2024-10-16 13:05:43 -05:00
Hui Zhou 3ce0e24b85 Merge pull request #7167 from raffenet/move-mydef
build: Move benchmark generation to test/mpi/autogen.sh

Approved-by: Hui Zhou
2024-10-16 11:01:46 -05:00
Ken Raffenetti ec4cc89f5e build: Move benchmark generation to autogen.sh in testsuite
These are only needed by the testsuite. If mydef is unavailable for some
reason, e.g. running autogen.sh from a testsuite-only tarball,
regeneration of the benchmarks will be skipped.
2024-10-16 11:01:24 -05:00
Yanfei Guo c88cc6313f Merge pull request #7168 from zhenggb72/fast-avx
gpu/ze: use stream load/store for GPU fast copy
2024-10-15 13:20:43 -05:00
Yanfei Guo 341542db7d mpl: fix parameter requirements to MPM_aligned_alloc
aligned_alloc requires the size to be multiples of alignment. We
add a round up for the size parameter. The posix_memalign requires
the alignment to be multiples of sizeof(void *), adding an assertion
to check.
2024-10-15 10:26:53 -05:00
Gengbin Zheng 05883b6a6c misc: add a CVAR for GPU fast copy threshold for D2H copy direction
Increase MPIR_CVAR_GPU_FAST_COPY_MAX_SIZE due to the improvement
of fast GPU copy by using stream load/store. For D2H, which is
used in send path when packing from send buffer to pack_buffer,
the threshold can be set higher to benefit from fast GPU copy.
2024-10-15 10:26:53 -05:00
Gengbin Zheng 4cc88e3f60 ch4/ofi: allocate pack buffers with alignment
allocate pack buffers with alignment to enable stream instructions
for GPU fast copy.
2024-10-15 10:26:53 -05:00
Gengbin Zheng 34cc352716 gpu/ze: use stream load/store for GPU fast copy when buffers are aligned 2024-10-15 10:26:53 -05:00
Hui Zhou 64788e67f6 Merge pull request #7173 from hzhou/2410_prog_debug
misc: enhance progress debugging

Approved-by: Ken Raffenetti
2024-10-14 21:19:42 -05:00
Hui Zhou 1e1f5a52a1 misc: refactor DEBUG_PROGRESS macros
We have multiple progress loops in various places. They can use the same
DEBUG_PROGRESS_ macros rather than redefine individually.

Also, add DEBUG_PROGRESS to posix release_gather progress loop.
2024-10-14 21:06:17 -05:00
Hui Zhou bf644f3b76 ch4: add more request infos for progress debugging
Annotate sender requests as well since they may get stuck too due to
lack of receiver progress. Adding the info helps to trace out a more
complete messaging map.
2024-10-14 21:06:17 -05:00
Hui Zhou e761b5edce ch4: fix progress debugging in MPIDIU_PROGRESS_WHILE
Since MPIDIU_PROGRESS_START declares local variable, we need protect it
with an additional scope in case the macro is used multiple times within
a function.
2024-10-14 21:06:17 -05:00
Hui Zhou 77011db544 Merge pull request #7172 from hzhou/2410_xpmem
ch4/xpmem: check MPIR_CVAR_CH4_IPC_XPMEM_P2P_THRESHOLD

Approved-by: Ken Raffenetti
2024-10-14 21:04:50 -05:00
Hui Zhou dd8bd5b6c5 ch4/xpmem: enhance error message for xpmem_attach
Add remote address and data size to the error message when xpmem_attach
fails. This may be useful to reveal some trivial errors.
2024-10-14 10:16:16 -05:00
Hui Zhou 84f577c529 ch4/xpmem: check MPIR_CVAR_CH4_IPC_XPMEM_P2P_THRESHOLD
For small messages, IPC adds the synchronization from receiver side,
which is not desirable. For example, applications may assume small
messages will be sent eagerly and have code with potential dead-lock
issue if that eager-assumption is not true.

NOTE: due to amortization of IPC memory registration cost, IPC path may
show performance benefit in micro-benchmarks even for tiny messages.
In practice, small buffers are often allocated from stack, thus are
less-applicable for the IPC benefit.
2024-10-14 10:08:00 -05:00
Ken Raffenetti 824b5b37c8 Merge pull request #7161 from hzhou/2410_3rd_pmi
mpir/pmi: protect 3rd-party pmi from job attributes

Approved-by: Ken Raffenetti <raffenet@mcs.anl.gov>
Approved-by: Hui Zhou <hzhou321@anl.gov>
2024-10-09 15:56:52 -05:00
Ken Raffenetti 7db04e31bc mpir/pmix: Remove process mapping query in pmix_build_nodemap
The PMIx API has more explicit nodemap construction functions we can
use. The decision whether or not to query the MPICH-style process
mapping string should come from the generic MPIR_pmi layer, not the PMIx
glue code. Instead, we translate the MPICH key "PMI_process_mapping" to
the PMIx format in pmix_get_jobattr.
2024-10-09 14:50:11 -05:00
Ken Raffenetti c25a2c7864 mpir/pmi: Reorder options for building nodemap
1. Try to get process mapping string from PMI server. This is how Hydra
   provides the information and is thus the preferred method for MPICH.
2. If using PMIx, use discovery functions. This is the preferred method
   when using 3rd party PMIx libraries.
3. Use fallback method, i.e. putting and getting node,rank pairs via the
   PMI KVS.
2024-10-09 14:46:43 -05:00
Hui Zhou 7a5f0f9d39 mpir/pmi: protect 3rd-party pmi from job attributes
Define macro PMI_FROM_3RD_PARTY if linked with a 3rd party pmi
library such as cray, Slurm, and openpmix. They often run into issues
when quieried with nonexistent job attribute keys such as
PMI_process_mapping, PMI_hwloc_pmi, etc.

We have a few patches recently to workaround openpmix with non-existent
job attributes.

Cray PMI used to work, but we encountered a hang with PALS v1.3.4.

This commit skips querying these keys as job attributes to bypass these
potential issues. They are often not supported by 3rd party pmi anyway.
2024-10-09 14:46:43 -05:00
Ken Raffenetti a1e7477f08 Merge pull request #7166 from raffenet/shm-pmi-bcast
mpid/shm: Add error checking for MPIR_pmi_bcast

Approved-by: Hui Zhou <hzhou321@anl.gov>
2024-10-09 10:35:22 -05:00
Ken Raffenetti f93488e7f9 mpid/shm: Add error checking for MPIR_pmi_bcast
If this call fails we need to report the error and exit. Otherwise we
will crash later and it will be hard to diagnose.
2024-10-09 10:35:01 -05:00
Ken Raffenetti 27229e0895 Merge pull request #7165 from giordano/mg/user-path-null
hydra: initialise `test_loc` and `user_path` to NULL

Approved-by: Ken Raffenetti <raffenet@mcs.anl.gov>
2024-10-08 15:24:00 -05:00
Mosè Giordano a38c6576da hydra: initialise test_loc and user_path to NULL 2024-10-08 10:30:41 +01:00
Hui Zhou cd026027b3 Merge pull request #7154 from hzhou/2410_gpu_fix
misc: fix issues revealed from ZE testing

Approved-by: Ken Raffenetti
Approved-by: Alex Brooks
2024-10-04 14:10:22 -05:00
Hui Zhou a5b3647233 misc: adapt MPL_gpu_fast_memcpy to copy direction
The ZE H2D copy, such as used in the recv path when unpacking from
received pack_buffer to original device buffer, can benefit from a
larger threshold due to higher cost using yaksa kernel.
2024-10-04 12:30:36 -05:00
Hui Zhou 838bd3dc9b ch4/ofi: remove contig restriction in gpu pipeline
Sender and receiver don't have to agree on the datatypes thus it can
easily result in deadlock restricting the sender to contiguous
datatypes.

Remove the contig condition seems work. It now passes the test
pt2pt/sendrecv1 2 arg=-sendmem=device arg=-recvmem=device
env=MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1.
2024-10-04 12:22:03 -05:00
Hui Zhou 0a8b1fea1d ch4/ofi: check strict_dev before register window base
Non-strict dev memory, i.e. shared memory, cannot be passed to
fi_mr_reg.
2024-10-04 12:22:03 -05:00
Hui Zhou d7f80c2a9f ch4/ofi: remove MPL_ze_mmap_device_pointer in ofi_recv
I believe this is an optimization similar to MPL_gpu_fast_memcpy.

We are seeing memory corruption issue -- sometime resulting in free
invalid pointer in finalize, and sometime it results in hang. Removing
this optimization resolves those issue.

Since later in ofi_events.h we use MPIR_Localcopy_gpu, which will
perform MPL_gpu_fast_memcpy optimization for small contig data, I don't
think the mmap here is necessary. In any regard, the mmap should be
performed at the time of copy and preferably inside MPIR_Localcopy_gpu.
2024-10-04 12:22:03 -05:00
Hui Zhou a372db3fe6 misc: Replace all direct usage of MPL_GPU_POINTER_DEV
Always use MPL_gpu_attr_is_{dev,strict_dev} and the MPIR_GPU_
equivalents. The subtleties of non strict device buffer (e.g. ZE) is not
obvious. Using the attr query wrappers makes the semantics explicit.
2024-10-04 12:22:03 -05:00
Hui Zhou ac0255c56d mpl+mpir: refactor MPL_gpu_query_pointer_is_dev
* Rename MPL_gpu_query_pointer_is_{dev,strict_dev} to
  MPL_gpu_attr_is_{dev,strict_dev} and remove the pointer argument.
  Such queries should always be two-step calls -- first call
  MPIR_GPU_query_pointer_attr, then call MPL_gpu_attr_is_{dev,strict_dev}.
  MPIR_CVAR_GPU_ENABLE is reflected in MPIR_GPU_query_pointer_attr.

* Remove the attr argument in MPIR_GPU_query_pointer_is{dev,strict_dev},
  thus make both pure pointer queries.
2024-10-04 12:22:03 -05:00
Hui Zhou 903da4644c misc: whitespace cleanup
To clear the whitespace check and spell check.
2024-10-04 12:22:03 -05:00
Hui Zhou 96cf5c4ee3 test/coll: fix typo in allgather_gpu.c
The argument of oddmem and evenmem are mistakenly swapped.
2024-10-04 12:22:03 -05:00
Ken Raffenetti b53cc3fb5b Merge pull request #7158 from raffenet/bc-allgather
mpid/bc: Check errors from node roots allgather

Approved-by: Hui Zhou <hzhou321@anl.gov>
2024-10-03 14:14:01 -05:00
Ken Raffenetti 7ba62d4a75 mpid/bc: Check errors from node roots allgather
If an error occurs during this collective, we need to report it.
2024-10-02 21:16:44 -05:00
Hui Zhou 1f359fea0f Merge pull request #6907 from hzhou/2311_bench
test: add p2p benchmark code

Approved-by: Ken Raffenetti
2024-10-02 16:59:17 -05:00
Hui Zhou 99c6adff2a test/bench: add support for device memory
Add device memory support using mtest_common utilities. This will add
the dependency to utility libraries, which the makefile already
imports.

However, this will remove the simpliicity of building single
source with mpicc or mydef_run. If one doesn't need test device memory,
one can simply comment off "$include macros/mtest.def" to restore the
simplicity.
2024-10-01 22:43:35 -05:00
Hui Zhou f2add2bed1 test/bench: add Makefile and testlist
"make testing" in test/mpi/bench should work.
2024-10-01 22:43:23 -05:00
Hui Zhou e4d96f828e test/runtests: add TestBench result check
This check does not capture output (thus test results will show in
console log) and only checks for exit code - zero means success and
nonzero means failure.

We'll use this check for benchmark tests.
2024-10-01 22:43:23 -05:00
Hui Zhou 6633f0a001 autogen: convert mydef code in autogen
We could add rules to directly work with mydef code in Makefile, but
convert the code in autogen removes the mydef dependency.

Also fix a spelling error.
2024-10-01 22:42:55 -05:00
Hui Zhou 30f2bbd438 test/mpi: add p2p benchmarks in test/mpi/bench
Add point-to-point benchmark code in MyDef. The tests have automatic
warm-ups and adjusts number of iterations for measurement accuracy.
It produces latency measurements with standard deviations and equivalent
bandwidths.

MYDEF_BOOT=[topsrc_dir]/modules/mydef_boot
export PATH=$MYDEF_BOOT/bin:$PATH
export PERL5LIB=$MYDEF_BOOT/lib/perl5
export MYDEFLIB=$MYDEF_BOOT/lib/MyDef

To run:
    mydef_page p2p_latency.def  # -> p2p_latency.c
    mpicc p2p_latency.c && mpi_run -n 2 ./a.out

Alternatively use mydef_run (uses settings from config):
    mydef_run p2p_latency.def

Next commit will add "make testing".
2024-10-01 22:42:00 -05:00
Hui Zhou 3f4988377b modules: add mydef_boot
MyDef provides general templating facilities.
2024-10-01 22:40:02 -05:00
Hui Zhou 9c907a4c73 Merge pull request #7120 from hzhou/2408_req_info
ch4/request: enhance progress debugging

Approved-by: Ken Raffenetti
2024-10-01 21:41:31 -05:00
Hui Zhou d3063cffde ch4: extend progress debugging timeout to window sync 2024-10-01 21:26:47 -05:00
Hui Zhou 9f3bbf33a2 request: abort on progress timeout
Since some launcher will hold console output, to make debugging progress
hang a bit easier, this commit makes the process abort on time out. We
delay the abort after first dump the stack backtrace to allow other
processes to also dump progress backtrace before killing them.
2024-10-01 21:26:47 -05:00
Hui Zhou 7c69b40be6 mpir: fix MPIR_REQUEST_SET_INFO
For some reason, we don't have MPL_snprintf, but only
MPL_snprintf_nowarn.

Correct the macro signature for when MPICH_DEBUG_PROGRESS is off.
2024-10-01 21:26:47 -05:00
Hui Zhou b62c6bf50f ch4: add recv request info for debug progress
Since the common progress time out is due to pending recv, add some
request info to help debugging.
2024-10-01 21:26:47 -05:00
Hui Zhou 3faa56acd5 Merge pull request #7142 from hzhou/2409_gpu_send
ch4/ofi: refactor ofi_send.h

Approved-by: Ken Raffenetti
2024-09-30 23:30:34 -05:00
Hui Zhou 8dfff88526 ch4/ofi: remove MPIDI_OFI_REQUEST(req, datatype)
There is no need to hold datatype unless it is needed for

* recv unpacking, which is held in
    MPIDI_OFI_REQUEST(req, noncontig.pack.datatype).

* recv iov type matching check, which is held in
    MPIDI_OFI_REQUEST(req, noncontig.nopack.datatype).
2024-09-30 21:53:16 -05:00
Hui Zhou 3999ac484a ch4/ofi: cleanup MPIDI_OFI_{send,recv}_iov
* Use local iovs variable to avoid repeated using
MPIDI_OFI_REQUEST(sreq, noncontig.nopack).

* Declare local variables where it is first used.

* It shouldn't be necessary to zero the iovs array.
2024-09-30 21:53:16 -05:00