# Consolidating Linux Power Management on ARM multiprocessor systems L.Pieralisi 27/10/2011 - ELC Europe ### **Outline** - 1 Introduction - Power Management Fundamentals - 2 ARM Kernel Power Management Consolidation - Motivation - ARM Common PM Code - Test Cases: Origen - CPU idle and CPU hotplug ### **Outline** - 1 Introduction - Power Management Fundamentals - 2 ARM Kernel Power Management Consolidation - Motivation - ARM Common PM Code - Test Cases: Origen - CPU idle and CPU hotplug ## **Power Management Fundamentals** "If you've heard this story before, don't stop me, because I'd like to hear it again" G.Marx ## **Power Management Fundamentals** #### **Total Energy Consumption** $$E = \int_0^t (CV_{dd}^2 f_c + V_{dd} I_{lkg}) dt$$ **Dynamic Energy Consumption** Leakage Energy Consumption $$\int_0^t CV_{dd}^2 f_c$$ $$\int_0^{ m t} { m V_{dd}} { m I_{lkg}}$$ ## **Power Management Fundamentals** #### **Total Energy Consumption** $$E = \int_0^t (CV_{dd}^2 f_c + V_{dd} I_{lkg}) dt$$ #### To minimize Iswitch Reduce voltage Reduce frequency Clock gating Reduce switched capacitance #### To minimize Ilkg Reduce voltage Less Leaky transistors Power gating L.Pieralisi ARM Ltd. ## **SoC Technology and Power Consumption** - Dynamic power, frequency scaling - Static (leakage) power, G (Generic) process, LP (Low Power) process - Temperature variations - RAM retention - CPU vs. IO devices - Need for more agressive and holistic power management ### **Power Managed SoC Example** - (ロ) (間) (量) (量) (量) (9) ## **Kernel Power Management Mechanics** | System suspend | User space forces system to sleep | |----------------|--------------------------------------| | CPU idle | Idle threads trigger sleep states | | CPU freq | CPU Frequency scaling | | Runtime PM | Devices Power Management | | CPU hotplug | Remove a CPU from the running system | #### Focus on CPU Power Management (PM) Get code in the kernel to enable efficient and stable core support for hotplug, cpu suspend and cpu idle Motivation ### **Outline** - 1 Introduction - Power Management Fundamentals - 2 ARM Kernel Power Management Consolidation - Motivation - ARM Common PM Code - Test Cases: Origen - CPU idle and CPU hotplug Motivation #### The Case for Generalization - SMP power down procedure is common and complex - Code can be shared across different platforms - Written, debugged, tested once for all #### **Our Goal** Merge in the kernel common code reusable by all partners - 1 save per CPU peripherals (IC, VFP, PMU) - 2 save CPU registers - 3 clean L1 D\$ - 4 clean state from L2 - 5 disable L1 D\$ allocation - 6 clean L1 D\$ - 7 exit coherency - 8 programme SCU CPU Power CTRL register - g call wfi (wait for interrupt) - 1 save per CPU peripherals (IC, VFP, PMU) - 2 save CPU registers - 3 clean L1 DS - 4 clean state from L2 - 5 disable L1 D\$ allocation - 6 clean L1 D\$ - 7 exit coherency - 8 programme SCU CPU Power CTRL register - g call wfi (wait for interrupt) - 1 save per CPU peripherals (IC, VFP, PMU) - 2 save CPU registers - 3 clean L1 D\$ - 4 clean state from L2 - 5 disable L1 D\$ allocation - 6 clean L1 D\$ - 7 exit coherency - 8 programme SCU CPU Power CTRL register - g call wfi (wait for interrupt) - 1 save per CPU peripherals (IC, VFP, PMU) - 2 save CPU registers - 3 clean L1 D\$ - 4 clean state from L2 - 5 disable L1 D\$ allocation - 6 clean L1 D\$ - 7 exit coherency - 8 programme SCU CPU Power CTRL register - g call wfi (wait for interrupt) - 1 save per CPU peripherals (IC, VFP, PMU) - 2 save CPU registers - 3 clean L1 D\$ - 4 clean state from L2 - 5 disable L1 D\$ allocation - 6 clean L1 D\$ - 7 exit coherency - 8 programme SCU CPU Power CTRL register - g call wfi (wait for interrupt) ←□▶←□▶←□▶←□▶ □◆□▶←□▶←□▶ □◆□▶ - 1 save per CPU peripherals (IC, VFP, PMU) - 2 save CPU registers - 3 clean L1 D\$ - 4 clean state from L2 - 5 disable L1 D\$ allocation - 6 clean L1 D\$ - 7 exit coherency - 8 programme SCU CPU Power CTRL register - g call wfi (wait for interrupt) This is the standard procedure that must be adopted by all platforms, for cpu hotplug (cache cleaning and wfi), suspend and idle - 1 save per CPU peripherals (IC, VFP, PMU) - 2 save CPU registers - 3 clean L1 D\$ - 4 clean state from L2 - 5 disable L1 D\$ allocation - 6 clean L1 D\$ - 7 exit coherency - 8 programme SCU CPU Power CTRL register - g call wfi (wait for interrupt) This is the standard procedure that must be adopted by all platforms, for cpu hotplug (cache cleaning and wfi), suspend and idle L.Pieralisi ARM Ltd. ### **Outline** - 1 Introduction - Power Management Fundamentals - 2 ARM Kernel Power Management Consolidation - Motivation - ARM Common PM Code - Test Cases: Origen - CPU idle and CPU hotplug ## **ARM Common PM Code Components** - CPU PM notifiers - cpu suspend/resume - L2 suspend/resume - CPUs coordination, hotplug # CPU PM notifiers (1/3) - Introduced by C.Cross to overcome code duplication in idle and suspend code path - CPU events and CLUSTER events - GIC, VFP, PMU # CPU PM notifiers (2/3) ``` static int cpu_pm_notify(enum cpu_pm_event event, int nr_to_call, int *nr_calls) int ret: ret = __raw_notifier_call_chain(&cpu_pm_notifier_chain, event, NULL, nr to call, nr calls): return notifier to errno(ret): int cpu_pm_enter(void) f...1 ret = cpu_pm_notify(CPU_PM_ENTER, -1, &nr_calls); if (ret) cpu pm notify(CPU PM ENTER FAILED, nr calls - 1, NULL): f...1 return ret; //CPU shutdown cpu pm {enter.exit}(): //Cluster shutdown cpu_cluster_pm_{enter,exit}(); ``` # CPU PM notifiers (3/3) ``` static int gic_notifier(struct notifier_block *self, unsigned long cmd, void *v) int i; [...] switch (cmd) { case CPU_PM_ENTER: gic_cpu_save(i); break: case CPU PM ENTER FAILED: case CPU_PM_EXIT: gic_cpu_restore(i); break: case CPU_CLUSTER_PM_ENTER: gic_dist_save(i); break: case CPU CLUSTER PM ENTER FAILED: case CPU_CLUSTER_PM_EXIT: gic_dist_restore(i); break: } return NOTIFY OK: 7 static struct notifier block gic notifier block = { .notifier_call = gic_notifier, }; ``` # CPU suspend (1/3) - Introduced by R.King to consolidate existing (and duplicated) code across diffent ARM platforms - save/restore core registers, clean L1 and some bits of L2 - L2 RAM retention handling poses further challenges # CPU suspend (2/3) - $lue{1}$ 1:1 mapping page tables cloned from $init\_mm$ - C API, generic for all ARM architectures # CPU suspend (3/3) registers saved on the stack ``` void __cpu_suspend_save(u32 *ptr, u32 ptrsz, u32 sp, u32 *save_ptr) { *save_ptr = virt_to_phys(ptr); /* This must correspond to the LDM in cpu_resume() assembly */ *ptr++ = virt_to_phys(suspend_pgd); *ptr++ = sp; *ptr++ = virt_to_phys(cpu_do_resume); cpu_do_suspend(ptr); } ``` ←□▶←□▶←□▶←□▶ □ ♥Q() # CPU suspend (3/3) - registers saved on the stack - L1 complete cleaning ``` void __cpu_suspend_save(u32 *ptr, u32 ptrsz, u32 sp, u32 *save_ptr) { *save_ptr = virt_to_phys(ptr); /* This must correspond to the LDM in cpu_resume() assembly */ *ptr++ = virt_to_phys(suspend_pgd); *ptr++ = sp; *ptr++ = virt_to_phys(cpu_do_resume); cpu_do_suspend(ptr); flush_cache_all(); } ``` # CPU suspend (3/3) - registers saved on the stack - L1 complete cleaning - L2 partial cleaning ## We Are Not Done, Yet: Cache-to-Cache migration "Once you eliminate your number one problem, number two gets a promotion." **G.**Weinger ### We Are Not Done, Yet: Cache-to-Cache migration - SCU keeps a copy of D\$ cache TAG RAMs - To avoid data traffic A9 MPCore moves dirty lines across cores - Lower L1 bus traffic - Dirty data might be fetched from another core during power-down sequence ## We Are Not Done, Yet: Cache-to-Cache migration - When the suspend finisher is called L1 is still allocating - accessing current implies accessing the sp - Snooping Direct Data Intervention (DDI), CPU might pull dirty line in ``` ENTRY(disable clean inv dcache v7 all) sp!, {r4-r5, r7, r9-r11, lr} stmfd p15, 0, r3, c1, c0, 0 hic r3, #4 @ clear C bit p15, 0, r3, c1, c0, 0 mcr isb h1 v7 flush dcache all p15, 0, r0, c1, c0, 1 mrc hic r0, r0, #0x40 @ exit SMP mcr p15, 0, r0, c1, c0, 1 sp!, {r4-r5, r7, r9-r11, pc} ldmfd ENDPROC(disable_clean_inv_dcache_v7_all) ``` ## We Are Not Done, Yet: Cache-to-Cache migration - When the suspend finisher is called L1 is still allocating - lacksquare accessing current implies accessing the $\operatorname{sp}$ - Snooping Direct Data Intervention (DDI), CPU might pull dirty line in ``` ENTRY(disable clean inv dcache v7 all) sp!, {r4-r5, r7, r9-r11, lr} stmfd p15, 0, r3, c1, c0, 0 hic r3, #4 @ clear C bit p15, 0, r3, c1, c0, 0 mcr isb h1 v7 flush dcache all p15, 0, r0, c1, c0, 1 mrc hic r0, r0, #0x40 @ exit SMP mcr p15, 0, r0, c1, c0, 1 sp!, {r4-r5, r7, r9-r11, pc} ldmfd ENDPROC(disable_clean_inv_dcache_v7_all) ``` ### We Are Not Done, Yet: Cache-to-Cache migration - When the suspend finisher is called L1 is still allocating - accessing current implies accessing the sp - Snooping Direct Data Intervention (DDI), CPU might pull dirty line in ``` ENTRY(disable clean inv dcache v7 all) sp!, {r4-r5, r7, r9-r11, lr} stmfd p15, 0, r3, c1, c0, 0 hic r3, #4 @ clear C bit p15, 0, r3, c1, c0, 0 mcr isb h1 v7 flush dcache all p15, 0, r0, c1, c0, 1 mrc hic r0, r0, #0x40 @ exit SMP mcr p15, 0, r0, c1, c0, 1 sp!, {r4-r5, r7, r9-r11, pc} ldmfd ENDPROC(disable_clean_inv_dcache_v7_all) ``` ## L2 Management: The Odd One Out (1/2) - L310 memory mapped device (aka outer cache) - Clearing C bit does NOT prevent allocation - L2 RAM retention, data sitting in L2, not accessible if MMU is off - If not invalidated, L2 might contain stale data if resume code runs with L2 off before enabling it - We could clean some specific bits: which ones ? - If retained, L2 must be resumed before turning MMU on ## L2 Management: The Odd One Out (1/2) - L310 memory mapped device (aka outer cache) - Clearing C bit does NOT prevent allocation - L2 RAM retention, data sitting in L2, not accessible if MMU is off - If not invalidated, L2 might contain stale data if resume code runs with L2 off before enabling it - We could clean some specific bits: which ones ? - If retained, L2 must be resumed before turning MMU on ## L2 Management: The Odd One Out (2/2) - if L2 content is lost, it must be cleaned on shutdown but can be resumed in C - if L2 is retained, it must be resumed in assembly before calling cpu resume 4 D > 4 B > 4 B > 4 B > B 9 Q C ## The Missing Bit: Security Management - L2 management in non-secure mode is fragmented - No standard, no way to have a unified solution - Something we should focus on - Implications (limitations) on hotplug and CPU logical numbering # Putting Everything Together (1/2) #### Common Idle Entry ``` void enter_idle(unsigned cstate, unsigned rstate, unsigned flags) __cpu_set(cpu_index, cpuidle_mask); if (cpumask_weight(cpuidle_mask) == num_online_cpus()) cluster->power_state = rstate; cpu_pm_enter(); if (cluster->power state >= SHUTDOWN) cpu_cluster_pm_enter(); cpu_suspend(0, suspend_finisher); cpu_pm_exit(); if (cluster->power_state >= SHUTDOWN) cpu_cluster_pm_exit(); __cpu_clear(cpu_index, cpu_idle_mask); return 0: ``` # Putting Everything Together (2/2) Test Cases: Origen ### **Outline** - 1 Introduction - Power Management Fundamentals - 2 ARM Kernel Power Management Consolidation - Motivation - ARM Common PM Code - Test Cases: Origen - CPU idle and CPU hotplug Test Cases: Origen ## **Origen Test Case** - Standard dual-Core A9 system - Deep idle C-states hit if one cpu is hotplugged - Lots of duplicated code (L2, GIC) - Lots of cruft removed @ Linaro Connect, took us two hours CPU idle and CPU hotplug ### **Outline** - 1 Introduction - Power Management Fundamentals - 2 ARM Kernel Power Management Consolidation - Motivation - ARM Common PM Code - Test Cases: Origen - CPU idle and CPU hotplug CPU idle and CPU hotplug ## **CPU idle and CPU hotplug** - ARM CPUs on an SMP cluster can be powered down independently - Most platforms require hotplug for logical cpu id !=0 before enabling deep C-states - Mechanism created for a purpose and used for the wrong one (high latencies) #### CPU idle + sched\_mc - let the scheduler migrate threads in a power efficient manner - ... and idle does the rest <□ > < 圖 > < 필 > < 필 > < 필 > < Θ < Θ #### **Conclusion** - Single CPU uncoordinated shutdown becoming more and more important as the number of cores grows - idle + sched\_mc long term solution to manage core idleness - PM notifiers and CPU/L2 suspend/resume code provides the basis for common PM for all ARM platforms - Lots of platform code getting cleaned up and consolidated - Outlook - Cluster support - Security management # THANKS !!!