Benchmarks

Here you can find some of the benchmarks made with EPW.

  • Scalability of under development EPW v6.0 (to be released with QE v7.5 and can be accessed at https://gitlab.com/epw/q-e.git ) on Frontera for MoS2 and H3 S. For MoS2 we use a k grid of 2002 and q grid of size 402. For H3 S we use a k grid of 803 and q grid of size 403. The EPW v6.0 will employ two-level parallelizatition with pool-parallelization for k grids and image-parallelization for q grids.

The calculations were performed using Intel 19.1.1 with Intel MPI and MKL on Intel Xeon Platinum 8280 (“Cascade Lake”) @ 2.7GHz [56 cores per node machine]. The scaling test was done during Texascale Days in December 2024.

../_images/EPW_scalability_Sabya_202412.png

Left: Strong scaling of the interpolation part of EPW on Frontera for MoS2. Here we calculate the imaginary self-energy. The two-level in EPW v6.0 (to be released) is employed with 1000 k pools and q images shown in the top x-axis. The absolute Wall time for the interpolation was 617 S. at 25,000 cores and 54 S. at 400,000 cores. Right: Strong scaling tests for the isotropic Eliashberg calculations on Frontera for H3 S (courtesy S. Mishra, SUNY Binghampton) employing the same two-level EPW v6.0. The absolute Wall time for the calculation was 64 min. at 25,000 cores and 12 min. at 200,000 cores. The Raw input/output files can be downloaded here. (S. Tiwari)

  • Scalability of the interpolation part of EPW v6-alpha2 (to be released) on Frontera for MgB2 on one million k and q points.

The calculations were performed using Intel 19.1.1 with Intel MPI and MKL on Intel Xeon Platinum 8280 (“Cascade Lake”) @ 2.7GHz [56 cores per node machine]. The scaling test was done during Texascale Days in December 2020.

../_images/EPW_scalability_202012.png

Strong scaling of the interpolation part of EPW on Frontera for MgB2. The hybrid two-level MPI and OpenMP in EPW v6-alpha2 (to be released) is employed with 56 k pools and 7 OpenMP threads. The absolute Wall time for the calculation was 288 min. at 27,440 cores and 21 min. at 439,040 cores. (H. Lee)

  • Scalability of the interpolation part of EPW v6-alpha1 (to be released) on Summit for MgB2 on 72x72x72 k and q points.

The calculations were performed using IBM XL 16.1.1 with IBM Spectrum MPI and ESSL on IBM Power9 @ 3.07GHz [42 cores per node machine]. The scaling test was done in November 2020.

../_images/EPW_scalability_202011.png

Strong scaling of the interpolation part of EPW on Summit for MgB2. The hybrid two-level MPI and OpenMP in EPW v6-alpha1 (to be released) is employed with 64 q pools and 7 OpenMP threads. The absolute Wall time for the calculation was 376 min. at 10,752 cores and 51 min. at 107,520 cores. (H. Lee)

  • Scalability of the interpolation part of EPW v6-alpha (to be released) on Frontera for MgB2 on one million k and q points.

The calculations were performed using Intel 19.0.5 with Intel MPI and MKL on Intel Xeon Platinum 8280 (“Cascade Lake”) @ 2.7GHz [56 cores per node machine]. The scaling test was done in March 2020.

../_images/EPW_scalability_202003.png

Strong scaling of the interpolation part of EPW on Frontera for MgB2. The two-level MPI parallelization in EPW v6-alpha (to be released) is employed with 56 q pools. The absolute Wall time for the calculation was 1,012 min. at 28,000 cores and 285 min. at 112,000 cores. (H. Lee)

  • Scalability of the interpolation part of EPW v5.0 on MareNostrum 4 for cubic CsPbI3 on 52,476 k-point grid and 4559 q-points.

The calculations were performed using Intel 17.4 with intel mpi, mkl and fftw on Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz [48 cores per node machine]. The scaling test was done in April 2018.

../_images/scaling_epw.png

Strong scaling of the interpolation part of EPW on MareNostrum for CsPbI3. The parallelization is done over k-points using MPI. The absolute time for the calculation was 29700 s at 240 cores and 1311 s at 15360 cores. (S. Poncé)

  • Scalability of the interpolation part of EPW v4.2 on CSD3 for polar SiC on a 64x64x64 k-point grid and 8x8x8 q-grid.

The calculations were performed using Intel 17.0.4 with intel mpi and mkl and with “-xAVX -mavx -axCOMMON-AVX512” vectorization flags on Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz. Intel Omni-Path HPC interconnect. Multi-petabyte SSD-accelerated Intel Lustre.

../_images/scaling_KNL.png

Strong scaling of the interpolation part of EPW on CSD3 Xeon Phi for the polar SiC. The parallelization is done over k-points using MPI. The absolute time for the calculation was 6h01 at 64 cores and 9 min at 8192 cores. (S. Poncé)

  • Scalability of the interpolation part of EPW v4.1 on ARCHER Cray XC30 for the polar wurtzite GaN.

The calculations were performed using the Intel 15.0.2.164 compiler on a Cray XC30 machine with 12-core Intel Xeon E5-2697v2 (Ivy Bridge) 2.7 GHz processors sharing 64GB of memory and joined by two QPI links, connected via proprietary Cray Aries interconnect (Dragonfly topology). The analysis was performed using Score-P 2.0.2 and Scalasca 2.3.1. instrumentation.

../_images/EPW_speedup_GaN.png

Scalability of the interpolation part of EPW on ARCHER Cray XC30 for the polar wurtzite GaN. The parallelization is done over k-points using MPI. (S. Poncé)

  • Scalability EPW v4.0 on SiC using a 6 × 6 × 6 Γ-centered k and q-points coarse grids.

The fine grids on which the Wannier interpolation was performed were a 50 × 50 × 50 k-point grid and a 10 × 10 × 10 q-point grid. The test was performed on an Intel Xeon CPU E5620 with a clock frequency of 2.40 GHz. The codes were compiled using ifort 13.0.1 with the following compilation flags -O2 -assume byterecl -g -traceback -nomodule -fpp. The MPI parallelization was performed using Open MPI 1.8.1.

../_images/EPW_scalability3.png

Parallelization in EPW to compute the electronic lifetime of SiC between v3 and v4 of EPW. The blue and red plain lines show the speedup obtained on a full calculation with the previous and current version of EPW. The speedup with 128 processors is 55 and 76 for the previous and current version, respectively. The interpolation algorithm (the most time consuming part) has been improved (dashed lines). (S. Poncé)

../_images/EPW_speedup3.png

Comparison of the time required to compute the electronic lifetime of SiC using EPW 3 and EPW 4.0, run on one processor. We show the time required for the calculation of the electrons and phonons perturbations using DFPT (QE+PH), the calculations of the electron-phonon matrix elements and their unfolding from the IBZ to the BZ using the crystal symmetries (Unfolding), the Wannierization from the coarse Bloch space to the real space (Wannier) and the interpolation from real space to fine grids in Bloch space (Interpolation). (S. Poncé)