资源隔离技术之CPU隔离

朵拉爱生活 · 发表于 2024-10-9 09:41:32

本期作者SYS-OSB站系统部操作系统（SYS-OS）团队负责公司OS层面系统软件支持覆盖内核优化、系统工具、操作系统镜像、软硬件结合等方向的工作01 混部的概念及现状目前我们说的混部技术，主要是将不同优先级的在线业务(通常为延迟敏感型高优先级任务)和离线任务(通常为延时不敏感型低优先级任务)部署在相同的物理机器上，达到提高资源利用率、降低成本的目的。混部场景下，CPU隔离的目标是当在线需要运行时，会“压制”离线，在线不运行时，离线利用空闲cpu运行。目前基于upstream的内核，对于在离线混部可以通过cpu share、优先级等方式使得在线业务尽可能压制离线，但是由于内核CFS设置有两种最小时间粒度保护：sched_min_granularity_ns 和 sched_wakeup_granularity_ns，实际效果并不那么理想。在离线业务的混跑，当在线和离线任务分别调度到一个核上，相互抢执行时间，意味着离线一旦抢占，便可以持续运行一个最小粒度的时间sched_min_granularity_ns，在线任务的唤醒延迟可能达sched_wakeup_granularity_ns，即在线业务的调度延迟可能会很大，导致在线业务的性能下降。另外，如果在离线业务跑到相互对应的一对HT上，还将面临超线程干扰的问题，虽然有core scheduling[1]技术，但是其设计初衷并非为了混部，设计和实现开销较大，这些都将直接影响在线业务的性能。我司在《B站云原生混部技术实践》[2]已实现混部机器的平均cpu使用率可以达到35%的混部效果，对于大规模混部的推进，正考虑从内核层隔离与可观测方向优化调度框架。为此，我们调研了龙蜥社区开源内核的Group Identity（以下简称GI）特性，该功能可以对每一个CPU cgroup设置身份标识，以区分cgroup中的任务优先级，达到CPU层级的隔离效果。我们将对该功能的主要部分进行原理分析并进行混部模拟测试，与诸君共赏。02 CFS模型说明Linux 内核默认提供了5个调度类，实际业务常用的有两种：CFS和实时调度器。所以在分析GI特性之前，我们先来简单看一下CFS（Completely Fair Scheduler，完全公平调度器)模型。先来看一下内核文档[3]对CFS的描述：80%的CFS设计可以总结成一句话，CFS在真实的硬件上基本模拟了一个“理想的、精确的多任务CPU”。“理想的多任务CPU”是一个拥有100%物理功率的CPU，它能精确地以同等速度并行运行每个任务，每个任务的运行速度为1/nr_running。每个CPU都有一个运行队列rq，每个rq会有一个cfs_rq运行队列，该结构包含一棵红黑树rb_tree，用来链接调度实体se，每次只能调度一个se到cpu上去运行。如果想要在任意时刻cfs_rq上的se运行时间都尽可能的接近，那么就需要不停地切换se上cpu运行，但是频繁的切换会有开销，想要减小这种开销，就需要减少切换的次数。为了可以减小开销还能保证时间上的统一，内核便给cfs_rq的se进行排序，让他们按照时间顺序挂在rb_tree上，这样每次取红黑树最左边的se，就可以得到运行时间的最小的那个。但实际上，se又会有优先级的概念，不同优先级的se所分配到的cpu时间片是不一样的（内核代码[4]的注释中提到，相差一个nice值，可能有约10%cpu的时间差）。内核便经过一系列公式的转换，可以得到一样的值，这个转换后的值称作虚拟运行时间vruntime，CFS实际上只需要保证每个任务运行的虚拟时间是相等的即可。每次挑选se上cpu运行，当分配的时间片用完，就会将它再放回到cfs_rq中，挂在红黑树的适当位置。当然也有可能会碰到运行过程中，时间片还未用完，但主动放弃运行的情况，如睡眠（TASK_INTERRUPTIBLE）或等待某种资源（TASK_UNINTERRUPTIBLE），这时就需要出列等待，进到“小黑屋”，直至相应事件发生才会再次放到红黑树中等待调度。等待后重新放入红黑树的se，如果休眠时间比较长，vruntime可能会非常小，便会迅速得到运行。为了避免其疯狂地执行，cfs_rq上会维护min_vruntime，如果新唤醒的vruntime(se) task_tick_fair--->entity_tick--->check_preempt_tick static voidcheck_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr){ ? ? ? ?... ? ? ? ?ideal_runtime = sched_slice(cfs_rq, curr); ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?----1 ? ? ? ?delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime; ? ? ? ? ? ? ? ????????if?(delta_exec?>?ideal_runtime)?{???????????????????????????????????????????????----2 ? ? ? ? ? ? ? ?resched_curr(rq_of(cfs_rq)); ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?clear_buddies(cfs_rq, curr); ? ? ? ? ? ? ? ?return; ? ? ? ?} ? ? ? ?if (should_expel_se(rq_of(cfs_rq), curr)) { ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ----3 ? ? ? ? ? ? ? ?resched_curr(rq_of(cfs_rq)); ? ? ? ? ? ? ? ?return; ? ? ? ?} #ifdef CONFIG_GROUP_IDENTITY ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?----4 ? ? ? ?if (is_highclass(curr) & delta_exec ?ideal_runtime?|id_preempt_all(curr,?se)?==?1)?????????????????????----7 ? ? ? ? ? ? ? ?resched_curr(rq_of(cfs_rq));}1. ideal_runtime是根据任务的权重得到的理论运行时间，delta_exec是当前任务截止本次更新运行的实际时间；2. 如果实际运行时间已经超过分配给任务的时间片，就会调用resched_curr设置TIF_NEED_RESCHED标志来触发抢占；3. should_expel_se 是GI特性中的SMT expeller技术的细节，我们下文分析该技术，埋坑1；?4. 为了防止频繁过度抢占，原生CFS通过比较delta_exec和sysctl_sched_min_granularity的值，来保证每个任务运行时间不小于单次最小运行时间粒度。而GI对于当前任务是normal和underclass的情况，会跳过此处逻辑，从而保证highclass任务可以及时地抢占资源；5. 这里的__pick_first_entity也做了HACK，我们下文详细分析，埋坑2；6. 从rb_tree中找到vruntime最小的se，如果当前任务的vrumtime仍然比rb_tree中最左边se的vruntime小，这种情况则不应触发抢占；7. 原生CFS在vruntime的差值大于ideal_runtime时才会触发抢占，GI判断如果se的优先级高于curr，则会无视这个条件，触发抢占，保证高优任务的及时抢占。?3.2 唤醒粒度来看下sched_wakeup_granularity_ns相关逻辑在GI特性中如何处理：static intwakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se){ ? ? ? ?int ret; ? ? ? ?s64 gran, vdiff = curr->vruntime - se->vruntime; ? ? ? ?ret = id_preempt_underclass(curr, se); ? ? ? ?if (ret) ? ? ? ? ? ? ? ?return ret; ? ? ? ?if (vdiff gran) ? ? ? ? ? ? ? ?return 1; ? ? ? ?return 0;} static inline intid_preempt_underclass(struct sched_entity *curr, struct sched_entity *se){ ? ? ? ?bool under_curr = is_underclass(curr); ? ? ? ?bool under_se = is_underclass(se); ? ? ? ?if (under_curr == under_se) {#ifdef CONFIG_SCHED_SMT ? ? ? ? ? ? ? ?/* Full of expellee is also underclass when on expel */ ? ? ? ? ? ? ? ?if (rq_on_expel(rq_of(cfs_rq_of(curr)))) { ? ? ? ? ? ? ? ? ? ? ? ?bool expel_curr = expellee_se(curr); ? ? ? ? ? ? ? ? ? ? ? ?bool expel_se = expellee_se(se); ? ? ? ? ? ? ? ? ? ? ? ?if (expel_curr != expel_se) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?return expel_se ? -1 : 1; ? ? ? ? ? ? ? ?}#endif ? ? ? ? ? ? ? ?return 0; ? ? ? ?} ? ? ? ?return under_se ? -1 : 1;} static inline intid_preempt_highclass(struct sched_entity *curr, struct sched_entity *se){ ? ? ? ?bool high_curr = is_highclass(curr); ? ? ? ?bool high_se = is_highclass(se); ? ? ? ?if (high_curr == high_se) ? ? ? ? ? ? ? ?return 0; ? ? ? ?return high_curr ? -1 : 1;}wakeup_preempt_entity判断se是否可以抢占curr，原生CFS逻辑需要比较vdiff和gran值决定是否抢占。在GI特性中，通过在wakeup_preempt_entity调用id_preempt_underclass和id_preempt_highclass，实现highclass和normal任务在唤醒时总是可以无条件地抢占underclass的任务；如果当前运行的是normal优先级任务，当highclass任务被唤醒时，vruntime小于normal优先级任务，highclass任务可以无视原有调度策略，进行资源抢占，这可以减小在线业务的唤醒延迟。?3.3 超线程干扰和SMT expeller技术对于超线程干扰问题，GI特性通过SMT expeller技术，引入SMT_EXPELLER身份标识，使得在运行时驱逐在smt对端的underclass任务，具体实现为在SMT的对端不会挑选underclass任务来运行。现在underclass任务都挂在低优先级的rb_tree树上，在pick_next_task时隐藏掉这些任务即可达到驱逐的目的。static inline bool should_expel_se(struct rq *rq, struct sched_entity *se){ ? ? ? ?return rq_on_expel(rq) & !is_expel_immune(se);} static inline bool rq_on_expel(struct rq *rq){ ? ? ? ?return rq->on_expel;} static inline bool is_expel_immune(struct sched_entity *se){ ? ? ? ?return __is_expel_immune(se, false);} static inline bool __is_expel_immune(struct sched_entity *se, bool wakeup){ ? ? ? ?bool ret = true; ? ? ? ?/* To expel if hierarchy contain underclass identity */ ? ? ? ?rcu_read_lock(); ? ? ? ?for_each_sched_entity(se) { ? ? ? ? ? ? ? ?if (is_underclass(se) || ? ? ? ? ? ? ? ? ? (!wakeup & expellee_se(se))) { ? ? ? ? ? ? ? ? ? ? ? ?ret = false; ? ? ? ? ? ? ? ? ? ? ? ?break; ? ? ? ? ? ? ? ?} ? ? ? ?} ? ? ? ?rcu_read_unlock(); ? ? ? ?return ret;} static inline bool expellee_se(struct sched_entity *se){ ? ? ? ?return se->my_q & !se->my_q->h_nr_expel_immune;}should_expel_se在SMT调度器的对端有ID_SMT_EXPELLER任务在运行，并且当前运行的se是/或只包含underclass任务时返回true，表示应该发生驱逐，所以上文check_preempt_tick判断如果需要驱逐se也需要触发抢占，填坑1完成。对于CFS调度器来说，挑选下一个任务对应pick_next_task_fair函数，会调用pick_next_entity从就绪队列中选择最适合运行的se。GI中的pick_next_entity函数代码如下：static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr){ ? ? ? ?struct sched_entity *left = __pick_first_entity(cfs_rq); ? ? ? ? ? ? ? ? ? ?----1 ? ? ? ? ? ? ?struct sched_entity *se; ? ? ? ?if (!left || (curr & id_entity_before(curr, left))) ? ? ? ? ? ? ? ? ? ? ? ?----2 ? ? ? ? ? ? ? ?left = should_expel_se(rq_of(cfs_rq), curr) ? left : curr; ? ? ? ?se = left; /* ideally we run the leftmost entity */ ????????if?(cfs_rq->skip?==?se)?{???????????????????????????????????????????????????----3 ? ? ? ? ? ? ? ?struct sched_entity *second; ? ? ? ? ? ? ? ?if (se == curr) { ? ? ? ? ? ? ? ? ? ? ? ?second = __pick_first_entity(cfs_rq); ? ? ? ? ? ? ? ?} else { ? ? ? ? ? ? ? ? ? ? ? ?second = __pick_next_entity(se); ? ? ? ? ? ? ? ? ? ? ? ?if (!second || (curr & id_entity_before(curr, second))) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?second = curr; ? ? ? ? ? ? ? ?} ? ? ? ? ? ? ? ?if (second & wakeup_preempt_entity(second, left) next & wakeup_preempt_entity(cfs_rq->next, left) next;????????}?else?if?(cfs_rq->last?&?wakeup_preempt_entity(cfs_rq->last,?left)?last; ? ? ? ?} ? ? ? ?clear_buddies(cfs_rq, se); ? ? ? ?if (rq_on_expel(rq_of(cfs_rq))) ? ? ? ? ? ? ? ?update_expel_start(cfs_rq, se); ? ? ? ?return se;}原生CFS中pick_next_entity[5]代码逻辑可概括如下：1. 摘取红黑树最左节点(vruntime最小)left；2. 如果left不存在，或者当前运行的任务curr符合vruntime(curr)skip记录了要跳过的se，如果选出来的se是cfs_rq→skip，那需要重新选择次优的second，如果second优于left，则将se赋值为second；4. cfs_rq→next记录了确实想要执行的se，比较cfs_rq->next和left，如果cfs_rq→next优于left，则将se赋值为cfs_rq→next；5. 为了利用cache局部性原理，cfs_rq→last记录了上次占用CPU的se，比较cfs_rq->last和left，如果cfs_rq->last优于left，则将se赋值为cfs_rq→last。相对于原生的CFS逻辑，GI在第1步__pick_first_entity选择红黑树上最左子节点做了HACK；在第2步比较最左子节点left和当前运行任务curr的vruntime，GI通过should_expel_se判断curr如果应该被驱逐，那么无视vruntime，保证不会选择underclass任务；第3、4、5步中调用的wakeup_preempt_entity也做了前文所言的HACK。GI在__pick_first_entity中调用id_rb_first_cached来选择最左子节点，我们看一下id_rb_first_cached的逻辑：static inline struct rb_node *id_rb_first_cached(struct cfs_rq *cfs_rq){ ? ? ? ?int i; ? ? ? ?struct rb_node *left; ? ? ? ?struct rb_root_cached *roots[2] = { ? ? ? ? ? ? ? ?&cfs_rq->tasks_timeline, ? ? ? ? ? ? ? ?&cfs_rq->under_timeline, ? ? ? ?}; ? ? ? ?check_expellee_se(cfs_rq); ? ? ? ?if (rq_on_expel(rq_of(cfs_rq))) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?return skip_expellee_se(cfs_rq); ? ? ? ?update_expel_spread(cfs_rq); ? ? ? ?if (cfs_rq->min_under_vruntime + get_expel_spread(cfs_rq) min_vruntime) { ? ? ? ? ? ? ? ?roots[0] = &cfs_rq->under_timeline; ? ? ? ? ? ? ? ?roots[1] = &cfs_rq->tasks_timeline; ? ? ? ?} ? ? ? ?for (i = 0; i expel_list, expel_node) { ? ? ? ? ? ? ? ?if (rq_on_expel(rq_of(cfs_rq)) & expellee_se(se)) ? ? ? ? ? ? ? ? ? ? ? ?continue; ? ? ? ? ? ? ? ?list_del_init(&se->expel_node); ? ? ? ? ? ? ? ?place_entity(cfs_rq, se, 0); ? ? ? ? ? ? ? ?__enqueue_entity(cfs_rq, se); ? ? ? ?}} static inline struct rb_node *skip_expellee_se(struct cfs_rq *cfs_rq){ ? ? ? ?struct rb_node *left = rb_first_cached(&cfs_rq->tasks_timeline); ? ? ? ?while (left) { ? ? ? ? ? ? ? ?struct sched_entity *se = ? ? ? ? ? ? ? ? ? ? ? ?rb_entry(left, struct sched_entity, run_node); ? ? ? ? ? ? ? ?if (!expellee_se(se)) ? ? ? ? ? ? ? ? ? ? ? ?break; ? ? ? ? ? ? ? ?left = rb_next(&se->run_node); ? ? ? ? ? ? ? ?__dequeue_entity(cfs_rq, se); ? ? ? ? ? ? ? ?list_add_tail(&se->expel_node, &cfs_rq->expel_list); ? ? ? ?} ? ? ? ?return left;}考虑某些使用cgroup的服务，对于一个highclass的父亲，可能会同时含有highclass和underclass的儿子，GI的做法是跳过全部都是underclass儿子的highclass父亲，做法是通过把它们从红黑树暂时孤立，直到驱逐结束或者它不只含有underclass为止。现在回头把坑2也填上，在check_preempt_tick第5步调用__pick_first_entity，如果最终没有找到se，那只可能是rq_on_expel且当前运行的任务不会是expellee，所以不应该发生抢占。GI中还有其他的身份标识和调度特性用来辅助降低highclass任务的调度延迟，诸如ID_EXPELLER_SHARE_CORE的特性用来分散ID_SMT_EXPELLER任务到不同的物理核，避免高优任务之间的相互干扰；在负载均衡方面，针对不同优先级的任务也有调整等。由于篇幅原因，这里不做过多赘述。04 模拟混部测试?4.1 测试说明为了测试GI方案的性能表现，我们通过docker部署8个schbench实例作为在线业务和8个sysbench实例模拟离线业务，采集在线业务schbench: *99.0th数据，来量化混部隔离效果。设置cpu.shares，在线业务优先级高于离线任务。为了防止numa之间的干扰，测试进行在同一numa节点的CPU上，基准条件为在线利用率从10%~30%，混部相应离线任务，比较了不混部、普通混部、混部绑核、混部GI四种策略在单节点利用率为40%、50%、60%、70%情况下的性能表现。不混部：单独跑在线和离线普通混部：混跑在线和离线，无CPU限制混部绑核：混跑在线和离线，在线无CPU限制，离线CPU限制在小范围上混部GI：混跑在线和离线，无CPU限制，开启GI测试机器信息：cpu：Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz，2 Sockets, 16 Cores per Socket, 2 Threads per Core；OS：Debian 9.13；kernel：upstream Linux 5.10.103+GI。?4.2 测试结果我们从两方面比较了几种策略的性能：1. 单节点利用率为40%、50%、60%、70%基准下，对比四种方案在线业务的延迟；2. 在线业务利用率10%、15%、20%、25%、30%基准下，比较提高离线任务利用率对在线业务的干扰。测试结果如下：单节点利用率为基准单节点利用率40%基准，在线业务p99延迟对比图单节点利用率50%基准，在线业务p99延迟对比图单节点利用率60%基准，在线业务p99延迟对比图单节点利用率70%基准，在线业务p99延迟对比图在线业务利用率为基准在线10%，提高离线任务混部率，在线业务p99延迟对比图在线15%，提高离线任务混部率，在线业务p99延迟对比图在线20%，提高离线任务混部率，在线业务p99延迟对比图在线25%，提高离线任务混部率，在线业务p99延迟对比图在线30%，提高离线任务混部率，在线业务p99延迟对比图?4.3 测试结论基于以上测试，我们得到如下结论：1. 以单节点利用率为基准，四种策略在线业务的延迟都随着单节点利用率上升而上升，在同样的单节点利用率下，在线业务的延迟表现：直接混部>混部绑核>GI>不混部，即除去不混部，GI的表现要好于混部绑核和普通混部的策略；2. 以在线业务的利用率为基准，随着混部离线任务的逐步增加，四种策略在线业务干扰情况：直接混部>混部绑核>GI>不混部，也就是对于普通混部和混部绑核的策略，随着混部离线任务利用率的上升，在线业务会受到明显的干扰，导致延迟的上升，但是在GI策略下，在线业务的p99延迟表现则较为平稳。05 结论与展望本文我们由混部技术的CPU隔离问题引入对龙蜥社区开源内核Group Identity特性的分析，并基于该特性进行模拟混部测试，结果表明：Group Identity技术可以赋予高优先级的任务更多的调度机会来最小化其调度延迟，并把低优先级任务对其带来的影响降到最低。对我司未来的大规模混部而言，在CPU隔离层面或是一个较好的选择。另一方面我们也发现，龙蜥开源内核的GI特性考虑的场景较多，设计也就变得复杂沉重。目前B站自研内核增强了CPU调度和内存方面的可观测性，通过实际的线上数据来定位开源方案的不足，并会针对这些不足，自研实现最适合B站混部的CPU隔离能力，持续助力降本增效。以上是今天的分享内容，如果你有什么想法或疑问，欢迎大家在留言区与我们互动，如果喜欢本期内容的话，请给我们点个赞吧！参考资料：[1] https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/core-scheduling.html[2]?B站云原生混部技术实践[3]?https://www.kernel.org/doc/html/latest/scheduler/sched-design-CFS.html[4]?https://elixir.bootlin.com/linux/v5.10.103/source/kernel/sched/core.c#L8440[5] https://elixir.bootlin.com/linux/v5.10.103/source/kernel/sched/fair.c#L4481[6]?https://gitee.com/anolis/cloud-kernel[7]?https://help.aliyun.com/document_detail/338407.html[8] 深入理解Linux进程调度

		自动登录	找回密码
密码			会员注册