tcreech.com

Dynamic Resource Pools in Illumos/SunOS

Solaris/Illumos/SmartOS have the ability to create persistent “resource pools”, for example, of processors. Trees and groups of processes can be restricted to run within these pools. This can be used to effect space sharing on a system.

More recent versions also have the ability to dynamically change these pools. Now, we have dynamic space sharing of the whole system. At first glance, this does not seem too different as a resource control from CPU percentage caps, but do not be fooled. This is a pretty interesting option for managing CPU resources on a big machine, but certainly a less-traveled road.

For the sake of simplicity, say we have two multi-threaded, CPU-bound processes on a two-processor machine. If we restrict them each to a CPU cap of 100%, then the thread scheduler will do the usual fine-grained time sharing between them, and simply try to ensure that neither one gets much more CPU time than the other. In a given second, both threads of both processes may run on both physical CPUs. This is probably the exact same result as just using the defaults, since schedulers tend to default toward fairness.

If we instead used resource pools, we would be granting either process exclusive access to their resources. So, for example, we can split the two CPUs up into two pools, and then run each of the CPU-bound processes in their own pool. They will both still get 100% CPU time, but in this scenario true space sharing is effected: in a given time period, each process’s thread(s) will ONLY be scheduled to one specific CPU. This is sometimes advantageous when you want to avoid multiple processes contending for cache and memory hardware resources on CPUs.

How do?

I think you can use any of commercial Solaris 11/12, OpenSolaris/OpenIndiana, OmniOS, or SmartOS for this. Below are some notes for minimally getting an example working. I’ve tested this on SmartOS 20151210T194528Z.

UPDATE: SmartOS cannot actually do dynamic pool management, seemingly because there’s no Java in the global zone. (Probably for the best, anyway.)

Enabling static resource pools

On most of these platforms, if you run pooladm, it will complain that resource pools are disabled.They can be enabled using pooladm -e. Once resource pools are enabled, running pooladm will describe its view of the system, including pool management goals, pools, and processor sets:

system default
        string  system.comment 
        int     system.version 1
        boolean system.bind-default true
        string  system.poold.objectives wt-load

        pool pool_default
                int     pool.sys_id 0
                boolean pool.active true
                boolean pool.default true
                int     pool.importance 1
                string  pool.comment 
                pset    pset_default

        pset pset_default
                int     pset.sys_id -1
                boolean pset.default true
                uint    pset.min 1
                uint    pset.max 65536
                string  pset.units population
                uint    pset.load 4329
                uint    pset.size 2
                string  pset.comment 

                cpu
                        int     cpu.sys_id 1
                        string  cpu.comment 
                        string  cpu.status on-line

                cpu
                        int     cpu.sys_id 0
                        string  cpu.comment 
                        string  cpu.status on-line

In the above example, we’ve got a default pool (“pool_default”), which consists of a default processor set (“pset_default”), which in turn consists of two CPUs.

Configuring resource pools

There are many ways to manage pool configurations, but for fiddling around in a transient fashion it is easiest to use poolcfg to control the in-kernel configuration directly.

Using poolcfg to do this looks like:

# poolcfg -dc 'info pool pool_default'

…which prints out just the pool pool_default tree from the pooladm output we saw earlier.

Now, say we want to create two pools, pool_foo ,and pool_bar, so that we can do some kind of space sharing with them:

# poolcfg -dc 'create pool pool_foo'
# poolcfg -dc 'create pool pool_bar'

Examining the output of pooladm, we see that the pools are created, and got assigned the default processor set consisting of all/both of the processors. Not too exciting yet.

Now, for the sake of this demonstration, say we have two zones (IDs “4” and “5”) running, which are eaching running a long-running, CPU-intensive multithreaded process, “cg.S.x”. Running the following dtrace script in each zone, we can see how much time the threads of cg.S.x are spending on each hardware processor in each zone.

dtrace -q -n 'profile:::profile-4997/execname == "cg.S.x"/{ @[cpu] = count();} tick-1hz{ normalize(@, 50); printa(@);clear(@); }'

Output looks something like this on both machines:

    1               21
    0               44

    0               47
    1               51

    1               37
    0               44

    1               37
    0               46

In other words, in a given second, they both are using both CPUs to run cg.S.x. In this case, the result is hardware contention and some misguided spinning on barriers, and low performance: about 80 Mop/s from each zone on average. Using CPU time limits here won’t help: they are already both using about 100% of CPU time per second each. The problem is not how much time they’re getting, but where they’re being scheduled.

At this point, we can bind groups of processes trees to the pools. For example, if we have our two zones (4 and 5) running:

# poolbind -p pool_foo -i zoneid 4
# poolbind -p pool_bar -i zoneid 5

If you check the dtrace scripts’ outputs, they have not changed, since the pools both have all processors assigned. We can now create new processor sets with each of the CPUs in them:

# poolcfg -dc 'create pset pset_left (uint pset.min=1; uint pset.max=2)'
# poolcfg -dc 'rename pset pset_default to pset_right'
# poolcfg -dc 'transfer 1 from pset pset_right to pset_left'

And, finally we can associate pool_foo with the non-default processor set:

# poolcfg -dc 'associate pool pool_foo ( pset pset_left )'

Have a look at the configuration now with pooladm:

system default
        string  system.comment 
        int     system.version 1
        boolean system.bind-default true
        string  system.poold.objectives wt-load

        pool pool_foo
                int     pool.sys_id 1
                boolean pool.active true
                boolean pool.default false
                int     pool.importance 1
                string  pool.comment 
                pset    pset_left

        pool pool_default
                int     pool.sys_id 0
                boolean pool.active true
                boolean pool.default true
                int     pool.importance 1
                string  pool.comment 
                pset    pset_right

        pool pool_bar
                int     pool.sys_id 2
                boolean pool.active true
                boolean pool.default false
                int     pool.importance 1
                string  pool.comment 
                pset    pset_right

        pset pset_left
                int     pset.sys_id 1
                boolean pset.default false
                uint    pset.min 1
                uint    pset.max 2
                string  pset.units population
                uint    pset.load 1154
                uint    pset.size 1
                string  pset.comment 

                cpu
                        int     cpu.sys_id 0
                        string  cpu.comment 
                        string  cpu.status on-line

        pset pset_right
                int     pset.sys_id -1
                boolean pset.default true
                uint    pset.min 1
                uint    pset.max 65536
                string  pset.units population
                uint    pset.load 3402
                uint    pset.size 1
                string  pset.comment 

                cpu
                        int     cpu.sys_id 1
                        string  cpu.comment 
                        string  cpu.status on-line

We see now that we’ve got pools pool_foo and pool_bar, each with dedicated processors. Recall that our zones are already restricted to operating within these pools, so we can glance back at that dtrace output. On zone 4, we see:

    1                0
    0               96

    1                0
    0               95

    1                0
    0               96

…and on zone 5:

    0                0
    1               97

    0                0
    1               92

    0                0
    1               96

We see that the zones are now operating on distinct processors. Hooray! How about performance? Well, since the zones can actually see how many CPUs they “have” (see getconf _NPROCESSORS_CONF or cat /proc/cpuinfo on a LX zone), they automatically use fewer threads to run cg.S.x, and there is no oversubscription or contention. Instead of less than 80 Mop/s each, we see about 1000 Mop/s each.

Dynamic resource pools

(At this point, the demo switches from a 2-core machine to a 64-core UltraSparc T2 on Solaris. SmartOS cannot start pools/dynamic, so I had to switch to something else.)

When could dynamic resource pools be useful?

So far in our exploration so far, the resource pools are not dynamic. For example, if I stop the process on zone 5, then zone 4 will still be restricted to only one CPU, and the machine will be underutilized. If we had dynamic resource pools (i.e., dynamic space sharing), then zones could exploit otherwise unused CPUs.

Ok, so now we’ve got our two zones running cg.S.x, but running 64 threads with spinning synchronization in each (128 total) competing for 64 hardware contexts. With no pools, each zone achieves a lousy ~5 Mop/s. The increased thread count and spinning synchronization are a disaster here. (The situation is admittedly better with OS-scheduled synchronization, but we can’t always control that.)

With each zone running in its own pool with 32 cores and on the same number of threads, we see about 1000 Mop/s each. This is not typical, but avoiding oversubscription in this case has yielded a 200X performance increase!

Next we stop one zone, and observe that without dynamic pools, it does not expand to use the other half of the machine, which is left idle: still about 1000 Mop/s.

Enabling dynamic pool resizing

Note that the processor sets have min/max sizes which a daemon could potentially use as constraints for automatically resizing resource pools based on some idea of demand. In fact, this is exactly what the dynamic resource pool implementation does. To enable the poold daemon, try:

# svcadm enable pools/dynamic

And watch (poolstat, htop, atop, or whatever) as the lone running zone’s processor set is slowly automatically increased! This takes a good while (maybe an hour, with 64 processors) with the default settings, but poold is fairly configurable.

At 32 threads, we saw ~1000 Mop/s, about 1200 at 34 threads, 1275 at 38 threads… 1300 at 39 threads… all the way up to 1600 Mop/s at 62 out of 64 threads.

Start the stopped zone back up, and watch them share again!

In summary, dynamic resource pools are pretty useful if you have malleable workloads – even if there isn’t severe resource contention like in these examples, it’s a sane, simple way to handle processor resources even in the face of uncooperative, varying loads.