FLAT Mode Example

Developer Guide

oneAPI GPU Optimization Guide

Download PDF

ID 771772

Date 4/18/2024

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

FLAT Mode Example - SYCL

In this section, we use a simple vector addition example to show how to scale the performance using multiple devices in FLAT mode.

Offloading to a single device (stack)

A first look at adding 2 vectors using one device or stack:


    float *da;
    float *db;
    float *dc;

    da = (float *)sycl::malloc_device<float>(gsize, q);
    db = (float *)sycl::malloc_device<float>(gsize, q);
    dc = (float *)sycl::malloc_device<float>(gsize, q);
    q.memcpy(da, ha, gsize);
    q.memcpy(db, hb, gsize);

    q.wait();

    std::cout << "Offloading work to 1 device" << std::endl;

    for (int i = 0; i < 16; i ++) {
        q.parallel_for(sycl::nd_range<1>(gsize, 1024),[=](auto idx) {
            int ind = idx.get_global_id();
            dc[ind] = da[ind] + db[ind];
        });
    }

    q.wait();

    std::cout << "Offloaded work completed" << std::endl;

    q.memcpy(hc, dc, gsize);

The above example adds two float vectors ha and hb and then saves the result in vector hc`.

The example first allocates device memory for the 3 vectors, then copies the data from the host to the device before launching the kernels on the device to do vector addition. After the computation on the device completes, the result in vector dc on the device is copied to the vector hc on the host.

Compilation and run commands:


$ icpx -fsycl flat_sycl_vec_add_single_device.cpp -o flat_sycl_vec_add_single_device
$ ./flat_sycl_vec_add_single_device

Offloading to multiple devices (stacks)

We can scale up the performance of the above example if multiple devices are available.

The following example starts with enumerating all devices on the platform to make sure that at least 2 devices are available.


    auto plat = sycl::platform(sycl::gpu_selector_v);
    auto devs = plat.get_devices();
    auto ctxt = sycl::context(devs);

    if (devs.size() < 2) {
       std::cerr << "No 2 GPU devices found" << std::endl;
       return -1;
    }

    std::cout << devs.size() << " GPU devices are found and 2 will be used" << std::endl;
    sycl::queue q[2];
    q[0] = sycl::queue(ctxt, devs[0], {sycl::property::queue::in_order()});
    q[1] = sycl::queue(ctxt, devs[1], {sycl::property::queue::in_order()});

Next, the example allocates the vectors ha and hb on the host and partitions each vector into 2 parts. Then the first halves of the vectors ha and hb are copied to the first device, and the second halves are copied to the second device.


    constexpr size_t gsize = 1024 * 1024 * 1024L;
    float *ha = (float *)(malloc(gsize * sizeof(float)));
    float *hb = (float *)(malloc(gsize * sizeof(float)));
    float *hc = (float *)(malloc(gsize * sizeof(float)));

    for (size_t i = 0; i < gsize; i++) {
        ha[i] = float(i);
        hb[i] = float(i + gsize);
    }

    float *da[2];
    float *db[2];
    float *dc[2];

    size_t lsize = gsize / 2;

    da[0] = (float *)sycl::malloc_device<float>(lsize, q[0]);
    db[0] = (float *)sycl::malloc_device<float>(lsize, q[0]);
    dc[0] = (float *)sycl::malloc_device<float>(lsize, q[0]);
    q[0].memcpy(da[0], ha, lsize);
    q[0].memcpy(db[0], hb, lsize);

    da[1] = (float *)sycl::malloc_device<float>((lsize + gsize % 2), q[1]);
    db[1] = (float *)sycl::malloc_device<float>((lsize + gsize % 2), q[1]);
    dc[1] = (float *)sycl::malloc_device<float>((lsize + gsize % 2), q[1]);
    q[1].memcpy(da[1], ha + lsize, lsize + gsize % 2);
    q[1].memcpy(db[1], hb + lsize, lsize + gsize % 2);

    q[0].wait();
    q[1].wait();

Once the data is available on the two devices, the vector addition kernels are launched on each device and the devices execute the kernels in parallel. After the computations on both devices complete, the results are copied from both the devices to the host.


    for (int i = 0; i < 16; i ++) {
        q[0].parallel_for(sycl::nd_range<1>(lsize, 1024),[=](auto idx) {
            int ind = idx.get_global_id();
            dc[0][ind] = da[0][ind] + db[0][ind];
        });
        q[1].parallel_for(sycl::nd_range<1>(lsize + gsize % 2, 1024),[=](auto idx) {
            int ind = idx.get_global_id();
            dc[1][ind] = da[1][ind] + db[1][ind];
        });
    }

    q[0].wait();
    q[1].wait();

    std::cout << "Offloaded work completed" << std::endl;

    q[0].memcpy(hc, dc[0], lsize);
    q[1].memcpy(hc + lsize, dc[1], lsize + gsize % 2);

    q[0].wait();
    q[1].wait();

Compilation and run commands:


$ icpx -fsycl flat_sycl_vec_add.cpp -o flat_sycl_vec_add
$ ./flat_sycl_vec_add

Note that this example uses 2 devices. It can easily be extended to use more than 2 devices if more than 2 devices are available. We leave this as an exercise.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

oneAPI GPU Optimization Guide

FLAT Mode Example - SYCL