docker系列--cgroups解讀

cikenerd 發布于2019-06-28 16:38 / 1694人閱讀

摘要：系列解讀系列解讀系列解讀系列解讀系列網絡模式解讀主要是隔離作用，主要是資源限制，聯合文件主要用于鏡像分層存儲和管理，是運行時，遵循了接口，一般來說基于。凍結暫停中的進程。配置時間都以微秒為單位，文件名中用表示。

前言

理解docker，主要從namesapce，cgroups，聯合文件，運行時(runC)，網絡幾個方面。接下來我們會花一些時間，分別介紹。

docker系列--namespace解讀

docker系列--cgroups解讀

docker系列--unionfs解讀

docker系列--runC解讀

docker系列--網絡模式解讀

namesapce主要是隔離作用，cgroups主要是資源限制，聯合文件主要用于鏡像分層存儲和管理，runC是運行時，遵循了oci接口，一般來說基于libcontainer。網絡主要是docker單機網絡和多主機通信模式。

cgroups簡介 cgroups是什么？

Cgroup是control group的簡寫，屬于Linux內核提供的一個特性，用于限制和隔離一組進程對系統資源的使用，也就是做資源QoS，這些資源主要包括CPU、內存、block I/O和網絡帶寬。Cgroup從2.6.24開始進入內核主線，目前各大發行版都默認打開了Cgroup特性。
Cgroups提供了以下四大功能:

資源限制（Resource Limitation）：cgroups可以對進程組使用的資源總額進行限制。如設定應用運行時使用內存的上限，一旦超過這個配額就發出OOM（Out of Memory）。

優先級分配（Prioritization）：通過分配的CPU時間片數量及硬盤IO帶寬大小，實際上就相當于控制了進程運行的優先級。

資源統計（Accounting）： cgroups可以統計系統的資源使用量，如CPU使用時長、內存用量等等，這個功能非常適用于計費。

進程控制（Control）：cgroups可以對進程組執行掛起、恢復等操作。

Cgroups中的三個組件

cgroup 控制組。cgroup 是對進程分組管理的一種機制，一個cgroup包含一組進程，并可以在這個cgroup上增加Linux subsystem的各種參數的配置，將一組進程和一組subsystem的系統參數關聯起來。

subsystem 子系統。subsystem 是一組資源控制的模塊。這塊在下面會詳細介紹。

hierarchy 層級樹。hierarchy 的功能是把一組cgroup串成一個樹狀的結構，一個這樣的樹便是一個hierarchy，通過這種樹狀的結構，Cgroups可以做到繼承。比如我的系統對一組定時的任務進程通過cgroup1限制了CPU的使用率，然后其中有一個定時dump日志的進程還需要限制磁盤IO，為了避免限制了影響到其他進程，就可以創建cgroup2繼承于cgroup1并限制磁盤的IO，這樣cgroup2便繼承了cgroup1中的CPU的限制，并且又增加了磁盤IO的限制而不影響到cgroup1中的其他進程。

cgroups子系統

cgroup中實現的子系統及其作用如下：

devices：設備權限控制。

cpuset：分配指定的CPU和內存節點。

cpu：控制CPU占用率。

cpuacct：統計CPU使用情況。

memory：限制內存的使用上限。

freezer：凍結（暫停）Cgroup中的進程。

net_cls：配合tc（traffic controller）限制網絡帶寬。

net_prio：設置進程的網絡流量優先級。

huge_tlb：限制HugeTLB的使用。

perf_event：允許Perf工具基于Cgroup分組做性能監測。

每個子系統的目錄下有更詳細的設置項，例如：
cpu

CPU資源的控制也有兩種策略，一種是完全公平調度（CFS：Completely Fair Scheduler）策略，提供了限額和按比例分配兩種方式進行資源控制；另一種是實時調度（Real-Time Scheduler）策略，針對實時進程按周期分配固定的運行時間。配置時間都以微秒（μs）為單位，文件名中用us表示。

cpuset CPU綁定：

除了限制 CPU 的使用量，cgroup 還能把任務綁定到特定的 CPU，讓它們只運行在這些 CPU 上，這就是 cpuset 子資源的功能。除了 CPU 之外，還能綁定內存節點（memory node）。
在把任務加入到 cpuset 的 task 文件之前，用戶必須設置 cpuset.cpus 和 cpuset.mems 參數。

cpuset.cpus：設置 cgroup 中任務能使用的 CPU，格式為逗號（,）隔開的列表，減號（-）可以表示范圍。比如，0-2,7 表示 CPU 第 0，1，2，和 7 核。

cpuset.mems：設置 cgroup 中任務能使用的內存節點，和 cpuset.cpus 格式一樣。

memory：

memory.limit_bytes：強制限制最大內存使用量，單位有k、m、g三種，填-1則代表無限制。

memory.soft_limit_bytes：軟限制，只有比強制限制設置的值小時才有意義。填寫格式同上。當整體內存緊張的情況下，task獲取的內存就被限制在軟限制額度之內，以保證不會有太多進程因內存挨餓。可以看到，加入了內存的資源限制并不代表沒有資源競爭。

memory.memsw.limit_bytes：設定最大內存與swap區內存之和的用量限制。填寫格式同上。

這里專門講一下監控和統計相關的參數，比如cadvisor采集的那些參數。

memory.usage_bytes：報???告???該??? cgroup中???進???程???使???用???的???當???前???總???內???存???用???量（以字節為單位）。

memory.max_usage_bytes：報???告???該??? cgroup 中???進???程???使???用???的???最???大???內???存???用???量。

docker如何使用cgroup

創建一個容器

# Run a container that will spawn 300 processes.
docker run cirocosta/stress pid  -n 300
Starting to spawn 300 blocking children
[1] Waiting for SIGINT

# Open another window and see that we have 300
# PIDS
docker stats
CONTAINER      …   MEM USAGE / LIMIT          PIDS
a730051832     …   21.02MiB / 1.951GiB     300

驗證Docker是否為此容器放置了一些cgroup

# let"s get the ID of the container. Docker uses that ID
# to name things in the host to we can probably use it to
# find the cgroup created for the container
# under the parent docker cgroup
docker ps
CONTAINER ID        IMAGE               COMMAND       
a730051832e7        cirocosta/stress    "pid -n 300"  

 # Having the prefix in hands, let"s search for it under the
 # mountpoint for cgroups in our system
 find  /sys/fs/cgroup/ -name "a730051832e7*"
 
/sys/fs/cgroup/cpu,cpuacct/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/cpuset/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/devices/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/pids/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/freezer/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/perf_event/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/blkio/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/memory/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/net_cls,net_prio/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/hugetlb/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959
/sys/fs/cgroup/systemd/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959

# There they are! Docker creates a control group with the name
# being the exact ID of the container under all the subsystems.

# What can we discover from this inspection? We can look at the
# subsystem that we want to place contrainst on (PIDs), for instance:

 tree /sys/fs/cgroup/pids/docker/a7300518327d...
/sys/fs/cgroup/pids/docker/a73005183...
├── cgroup.clone_children
├── cgroup.procs
├── notify_on_release
├── pids.current
├── pids.events
├── pids.max
└── tasks

# Which means that, if we want to know how many PIDs are in use right
# now we can look at "pids.current", to know the limits, "pids.max" and
# to know which processes have been assigned to this control group,
# look at tasks. Lets do it:
cat /sys/fs/cgroup/pids/docker/a730...c660a75a959/tasks 
5329
5371
5372
5373
5374
5375
5376
5377
(...)
# continues until the 300th entry - as we have 300 processes in this container

# 300 pids
cat /sys/fs/cgroup/pids/docker/a730051832e7d7764...9/pids.current
300

# no max set
cat /sys/fs/cgroup/pids/docker/a730051832e7d77.../pids.max 
max

一般在安裝k8s的過程中經常會遇到如下錯誤：

create kubelet: misconfiguration: kubelet cgroup driver: "cgroupfs"
is different from docker cgroup driver: "systemd"?

其實此處錯誤信息已經很明白了，就是docker 和kubelet指定的cgroup driver不一樣。 docker
支持systemd和cgroupfs兩種驅動方式。通過runc代碼可以更加直觀了解。

cgroup 只能限制 CPU 的使用，而不能保證CPU的使用。也就是說，使用
cpuset-cpus，可以讓容器在指定的CPU或者核上運行，但是不能確保它獨占這些CPU；cpu-shares
是個相對值，只有在CPU不夠用的時候才其作用。也就是說，當CPU夠用的時候，每個容器會分到足夠的CPU；不夠用的時候，會按照指定的比重在多個容器之間分配CPU。

對內存來說，cgroups 可以限制容器最多使用的內存。使用 -m 參數可以設置最多可以使用的內存。

代碼解讀

關于cgroups在runc的代碼部分，大家可以點擊進去詳細閱讀。這邊我們只講一個大概。
首先container的創建是由factory調用create方法實現的，而cgroup相關，factory實現了根據配置文件cgroup drive驅動的配置項，新建CgroupsManager的方法，systemd和cgroupfs兩種實現方式：

// SystemdCgroups is an options func to configure a LinuxFactory to return
// containers that use systemd to create and manage cgroups.
func SystemdCgroups(l *LinuxFactory) error {
    l.NewCgroupsManager = func(config *configs.Cgroup, paths map[string]string) cgroups.Manager {
        return &systemd.Manager{
            Cgroups: config,
            Paths:   paths,
        }
    }
    return nil
}

// Cgroupfs is an options func to configure a LinuxFactory to return containers
// that use the native cgroups filesystem implementation to create and manage
// cgroups.
func Cgroupfs(l *LinuxFactory) error {
    l.NewCgroupsManager = func(config *configs.Cgroup, paths map[string]string) cgroups.Manager {
        return &fs.Manager{
            Cgroups: config,
            Paths:   paths,
        }
    }
    return nil
}

抽象cgroup manager接口。接口如下：

type Manager interface {
    // Applies cgroup configuration to the process with the specified pid
    Apply(pid int) error

    // Returns the PIDs inside the cgroup set
    GetPids() ([]int, error)

    // Returns the PIDs inside the cgroup set & all sub-cgroups
    GetAllPids() ([]int, error)

    // Returns statistics for the cgroup set
    GetStats() (*Stats, error)

    // Toggles the freezer cgroup according with specified state
    Freeze(state configs.FreezerState) error

    // Destroys the cgroup set
    Destroy() error

    // The option func SystemdCgroups() and Cgroupfs() require following attributes:
    //     Paths   map[string]string
    //     Cgroups *configs.Cgroup
    // Paths maps cgroup subsystem to path at which it is mounted.
    // Cgroups specifies specific cgroup settings for the various subsystems

    // Returns cgroup paths to save in a state file and to be able to
    // restore the object later.
    GetPaths() map[string]string

    // Sets the cgroup as configured.
    Set(container *configs.Config) error
}

在創建container的過程中，會調用上面接口的方法。例如：
在container_linux.go中，

func (c *linuxContainer) Set(config configs.Config) error {
    c.m.Lock()
    defer c.m.Unlock()
    status, err := c.currentStatus()
    if err != nil {
        return err
    }
    ...
    if err := c.cgroupManager.Set(&config); err != nil {
        // Set configs back
        if err2 := c.cgroupManager.Set(c.config); err2 != nil {
            logrus.Warnf("Setting back cgroup configs failed due to error: %v, your state.json and actual configs might be inconsistent.", err2)
        }
        return err
    }
...
}

接下來我們重點講一下fs的實現。

在fs中，基本上每個子系統都是一個文件，如上圖。

重點說一下memory.go，即memory子系統,其他子系統與此類似。
關鍵方法：

func (s *MemoryGroup) Apply(d *cgroupData) (err error) {
    path, err := d.path("memory")
    if err != nil && !cgroups.IsNotFound(err) {
        return err
    } else if path == "" {
        return nil
    }
    if memoryAssigned(d.config) {
        if _, err := os.Stat(path); os.IsNotExist(err) {
            if err := os.MkdirAll(path, 0755); err != nil {
                return err
            }
            // Only enable kernel memory accouting when this cgroup
            // is created by libcontainer, otherwise we might get
            // error when people use `cgroupsPath` to join an existed
            // cgroup whose kernel memory is not initialized.
            if err := EnableKernelMemoryAccounting(path); err != nil {
                return err
            }
        }
    }
    defer func() {
        if err != nil {
            os.RemoveAll(path)
        }
    }()

    // We need to join memory cgroup after set memory limits, because
    // kmem.limit_in_bytes can only be set when the cgroup is empty.
    _, err = d.join("memory")
    if err != nil && !cgroups.IsNotFound(err) {
        return err
    }
    return nil
}

通過d.path("memory")查找到cgroup的memory路徑

func (raw *cgroupData) path(subsystem string) (string, error) {
    mnt, err := cgroups.FindCgroupMountpoint(subsystem)
    // If we didn"t mount the subsystem, there is no point we make the path.
    if err != nil {
        return "", err
    }

    // If the cgroup name/path is absolute do not look relative to the cgroup of the init process.
    if filepath.IsAbs(raw.innerPath) {
        // Sometimes subsystems can be mounted together as "cpu,cpuacct".
        return filepath.Join(raw.root, filepath.Base(mnt), raw.innerPath), nil
    }

    // Use GetOwnCgroupPath instead of GetInitCgroupPath, because the creating
    // process could in container and shared pid namespace with host, and
    // /proc/1/cgroup could point to whole other world of cgroups.
    parentPath, err := cgroups.GetOwnCgroupPath(subsystem)
    if err != nil {
        return "", err
    }

    return filepath.Join(parentPath, raw.innerPath), nil
}

d.join("memory")，將pid寫到memory路徑下

func (raw *cgroupData) join(subsystem string) (string, error) {
    path, err := raw.path(subsystem)
    if err != nil {
        return "", err
    }
    if err := os.MkdirAll(path, 0755); err != nil {
        return "", err
    }
    if err := cgroups.WriteCgroupProc(path, raw.pid); err != nil {
        return "", err
    }
    return path, nil
}

文章版權歸作者所有，未經允許請勿轉載,若此文章存在違規行為，您可以聯系管理員刪除。

轉載請注明本文地址：http://specialneedsforspecialkids.com/yun/27450.html

docker系列--cgroups解讀

摘要：系列解讀系列解讀系列解讀系列解讀系列網絡模式解讀主要是隔離作用，主要是資源限制，聯合文件主要用于鏡像分層存儲和管理，是運行時，遵循了接口，一般來說基于。凍結暫停中的進程。配置時間都以微秒為單位，文件名中用表示。前言理解docker，主要從namesapce，cgroups，聯合文件，運行時(runC)，網絡幾個方面。接下來我們會花一些時間，分別介紹。 docker系列--name...

岳光 2019-07-01 17:30 評論0 收藏0
docker系列--cgroups解讀

摘要：系列解讀系列解讀系列解讀系列解讀系列網絡模式解讀主要是隔離作用，主要是資源限制，聯合文件主要用于鏡像分層存儲和管理，是運行時，遵循了接口，一般來說基于。凍結暫停中的進程。配置時間都以微秒為單位，文件名中用表示。前言理解docker，主要從namesapce，cgroups，聯合文件，運行時(runC)，網絡幾個方面。接下來我們會花一些時間，分別介紹。 docker系列--name...

alogy 2019-07-01 16:47 評論0 收藏0
docker系列--網絡模式解讀

摘要：網絡主要是單機網絡和多主機通信模式。下面分別介紹一下的各個網絡模式。設計的網絡模型。是以對定義的元數據。用戶可以通過定義這樣的元數據來自定義和驅動的行為。前言理解docker，主要從namesapce，cgroups，聯合文件，運行時(runC)，網絡幾個方面。接下來我們會花一些時間，分別介紹。 docker系列--namespace解讀 docker系列--cgroups解讀 ...

haitiancoder 2019-07-01 17:30 評論0 收藏0
docker系列--網絡模式解讀

摘要：網絡主要是單機網絡和多主機通信模式。下面分別介紹一下的各個網絡模式。設計的網絡模型。是以對定義的元數據。用戶可以通過定義這樣的元數據來自定義和驅動的行為。前言理解docker，主要從namesapce，cgroups，聯合文件，運行時(runC)，網絡幾個方面。接下來我們會花一些時間，分別介紹。 docker系列--namespace解讀 docker系列--cgroups解讀 ...

zollero 2019-06-28 16:38 評論0 收藏0
docker系列--網絡模式解讀

摘要：網絡主要是單機網絡和多主機通信模式。下面分別介紹一下的各個網絡模式。設計的網絡模型。是以對定義的元數據。用戶可以通過定義這樣的元數據來自定義和驅動的行為。前言理解docker，主要從namesapce，cgroups，聯合文件，運行時(runC)，網絡幾個方面。接下來我們會花一些時間，分別介紹。 docker系列--namespace解讀 docker系列--cgroups解讀 ...

xiaotianyi 2019-07-01 16:47 評論0 收藏0