探索runC (下)

jzman 發布于2019-06-28 16:55 / 408人閱讀

摘要：而不幸的是是多線程的。至此，子進程就從父進程處得到了的配置，繼續往下，又創建了兩個從注釋中了解到，這是為了和它自己的子進程和孫進程進行通信。

回顧

本文接探索runC(上)

前文講到，newParentProcess() 根據源自 config.json 的配置，最終生成變量 initProcess ，這個 initProcess 包含的信息主要有

cmd 記錄了要執行的可執行文件名，即 "/proc/self/exe init"，注意不要和容器要執行的 sleep 5 混淆了

cmd.Env 記錄了名為 _LIBCONTAINER_FIFOFD=%d 記錄的命名管道exec.fifo 的描述符，名為_LIBCONTAINER_INITPIPE=%d記錄了創建的 SocketPair 的 childPipe 一端的描述符，名為_LIBCONTAINER_INITTYPE="standard"記錄要創建的容器中的進程是初始進程

initProcess 的 bootstrapData 記錄了新的容器要創建哪些類型的 Namespace。

/* libcontainer/container_linux.go */
func (c *linuxContainer) start(process *Process) error {
    parent, err := c.newParentProcess(process) /*  1. 創建parentProcess (已完成) */

    err := parent.start();                     /*  2. 啟動這個parentProcess */
    ......

準備工作完成之后，就要調用 start() 方法啟動。

注意: 此時 sleep 5 線索存儲在變量 parent 中

runC create的實現原理 (下)

start() 函數實在太長了，因此逐段來看

/* libcontainer/process_linux.go */
func (p *initProcess) start() error {
     
    p.cmd.Start()                 
    p.process.ops = p    
    io.Copy(p.parentPipe, p.bootstrapData)

    .....
}

p.cmd.Start() 啟動 cmd 中設置的要執行的可執行文件 /proc/self/exe，參數是 init，這個函數會啟動一個新的進程去執行該命令，并且不會阻塞。

io.Copy 將 p.bootstrapData 中的數據通過 p.parentPipe 發送給子進程

/proc/self/exe 正是runc程序自己，所以這里相當于是執行runc init，也就是說，我們輸入的是runc create命令，隱含著又去創建了一個新的子進程去執行runc init。為什么要額外重新創建一個進程呢？原因是我們創建的容器很可能需要運行在一些獨立的 namespace 中，比如 user namespace,這是通過 setns() 系統調用完成的，而在setns man page中寫了下面一段話

A multi‐threaded process may not change user namespace with setns().  It is not permitted  to  use  setns() to reenter the caller"s current user names‐pace

即多線程的進程是不能通過 setns()改變user namespace的。而不幸的是 Go runtime 是多線程的。那怎么辦呢？所以setns()必須要在Go runtime 啟動之前就設置好,這就要用到cgo了，在Go runtime 啟動前首先執行嵌入在前面的 C 代碼。

具體的做法在nsenter README描述在runc init命令的響應在文件 init.go 開頭，導入 nsenter 包

/* init.go */
import (
    "os"
    "runtime"

    "github.com/opencontainers/runc/libcontainer"
    _ "github.com/opencontainers/runc/libcontainer/nsenter"
    "github.com/urfave/cli"
)

而nsenter包中開頭通過 cgo 嵌入了一段 C 代碼, 調用 nsexec()

package nsenter
/*
/* nsenter.go */
#cgo CFLAGS: -Wall
extern void nsexec();
void __attribute__((constructor)) init(void) {
    nsexec();
}
*/
import "C"

接下來，輪到 nsexec() 完成為容器創建新的 namespace 的工作了, nsexec() 同樣很長，逐段來看

/* libcontainer/nsenter/nsexec.c */
void nsexec(void)
{
    int pipenum;
    jmp_buf env;
    int sync_child_pipe[2], sync_grandchild_pipe[2];
    struct nlconfig_t config = { 0 };

    /*
     * If we don"t have an init pipe, just return to the go routine.
     * We"ll only get an init pipe for start or exec.
     */
    pipenum = initpipe();
    if (pipenum == -1)
        return;

    /* Parse all of the netlink configuration. */
    nl_parse(pipenum, &config);
   
    ......

上面這段 C 代碼中，initpipe() 從環境中讀取父進程之前設置的pipe(_LIBCONTAINER_INITPIPE記錄的的文件描述符)，然后調用 nl_parse 從這個管道中讀取配置到變量 config ，那么誰會往這個管道寫配置呢 ? 當然就是runc create父進程了。父進程通過這個pipe，將新建容器的配置發給子進程，這個過程如下圖所示:

發送的具體數據在 linuxContainer 的 bootstrapData() 函數中封裝成netlink msg格式的消息。忽略大部分配置，本文重點關注namespace的配置，即要創建哪些類型的namespace，這些都是源自最初的config.json文件。

至此，子進程就從父進程處得到了namespace的配置，繼續往下， nsexec() 又創建了兩個socketpair,從注釋中了解到，這是為了和它自己的子進程和孫進程進行通信。

void nsexec(void)
{
   .....
    /* Pipe so we can tell the child when we"ve finished setting up. */
    if (socketpair(AF_LOCAL, SOCK_STREAM, 0, sync_child_pipe) < 0)  //  sync_child_pipe is an out parameter
        bail("failed to setup sync pipe between parent and child");

    /*
     * We need a new socketpair to sync with grandchild so we don"t have
     * race condition with child.
     */
    if (socketpair(AF_LOCAL, SOCK_STREAM, 0, sync_grandchild_pipe) < 0)
        bail("failed to setup sync pipe between parent and grandchild");
   
}

然后就該創建namespace了，看注釋可知這里其實有考慮過三個方案

first clone then clone

first unshare then clone

first clone then unshare

最終采用的是方案 3,其中緣由由于考慮因素太多，所以準備之后另寫一篇文章分析

接下來就是一個大的 switch case 編寫的狀態機,大體結構如下，當前進程通過clone()系統調用創建子進程，子進程又通過clone()系統調用創建孫進程，而實際的創建/加入namespace是在子進程完成的

switch (setjmp(env)) {
  case JUMP_PARENT:{
           .....
           clone_parent(&env, JUMP_CHILD);
           .....
       }
  case JUMP_CHILD:{
           ......
           if (config.namespaces)
                join_namespaces(config.namespaces);
           clone_parent(&env, JUMP_INIT);
           ......
       }
  case JUMP_INIT:{
       }

本文不準備展開分析這個狀態機了，而將這個狀態機的流程畫在了下面的時序圖中，需要注意的是以下幾點

namespaces在runc init 2完成創建

runc init 1和runc init 2最終都會執行exit(0),但runc init 3不會，它會繼續執行runc init命令的后半部分。因此最終只會剩下runc create進程和runc init 3進程

再回到runc create進程

func (p *initProcess) start() error {

    p.cmd.Start()
    p.process.ops = p
    io.Copy(p.parentPipe, p.bootstrapData);

    p.execSetns()
    ......

再向 runc init發送了 bootstrapData 數據后，便調用 execSetns() 等待runc init 1進程終止，從管道中得到runc init 3的進程 pid,將該進程保存在 p.process.ops

/* libcontainer/process_linux.go */
func (p *initProcess) execSetns() error {
    status, err := p.cmd.Process.Wait()

    var pid *pid
    json.NewDecoder(p.parentPipe).Decode(&pid)

    process, err := os.FindProcess(pid.Pid)

    p.cmd.Process = process
    p.process.ops = p
    return nil
}

繼續 start()

func (p *initProcess) start() error {

    ...... 
    p.execSetns()
    
    fds, err := getPipeFds(p.pid())
    p.setExternalDescriptors(fds)
    p.createNetworkInterfaces()
    
    p.sendConfig()
    
    parseSync(p.parentPipe, func(sync *syncT) error {
        switch sync.Type {
        case procReady:
            .....
            writeSync(p.parentPipe, procRun);
            sentRun = true
        case procHooks:
            .....
            // Sync with child.
            err := writeSync(p.parentPipe, procResume); 
            sentResume = true
        }

        return nil
    })
    ......

可以看到，runc create又開始通過pipe進行雙向通信了，通信的對端自然就是runc init 3進程了，runc init 3進程在執行完嵌入的 C 代碼后(實際是runc init 1執行的，但runc init 3也是由runc init 1間接clone()出來的)，因此將開始運行 Go runtime，開始響應init命令

sleep 5 通過 p.sendConfig() 發送給了runc init進程

init命令首先通過 libcontainer.New("") 創建了一個 LinuxFactory,這個方法在上篇文章中分析過，這里不再解釋。然后調用 LinuxFactory 的 StartInitialization() 方法。

/* libcontainer/factory_linux.go */
// StartInitialization loads a container by opening the pipe fd from the parent to read the configuration and state
// This is a low level implementation detail of the reexec and should not be consumed externally
func (l *LinuxFactory) StartInitialization() (err error) {
    var (
        pipefd, fifofd int
        envInitPipe    = os.Getenv("_LIBCONTAINER_INITPIPE")  
        envFifoFd      = os.Getenv("_LIBCONTAINER_FIFOFD")
    )

    // Get the INITPIPE.
    pipefd, err = strconv.Atoi(envInitPipe)

    var (
        pipe = os.NewFile(uintptr(pipefd), "pipe")
        it   = initType(os.Getenv("_LIBCONTAINER_INITTYPE")) // // "standard" or "setns"
    )
    
    // Only init processes have FIFOFD.
    fifofd = -1
    if it == initStandard {
        if fifofd, err = strconv.Atoi(envFifoFd); err != nil {
            return fmt.Errorf("unable to convert _LIBCONTAINER_FIFOFD=%s to int: %s", envFifoFd, err)
        }
    }

    i, err := newContainerInit(it, pipe, consoleSocket, fifofd)

    // If Init succeeds, syscall.Exec will not return, hence none of the defers will be called.
    return i.Init() //
}

StartInitialization() 方法嘗試從環境中讀取一系列_LIBCONTAINER_XXX變量的值，還有印象嗎？這些值全是在runc create命令中打開和設置的，也就是說，runc create通過環境變量，將這些參數傳給了子進程runc init 3

拿到這些環境變量后，runc init 3調用 newContainerInit 函數

/* libcontainer/init_linux.go */
func newContainerInit(t initType, pipe *os.File, consoleSocket *os.File, fifoFd int) (initer, error) {
    var config *initConfig

    /* read config from pipe (from runc process) */
    son.NewDecoder(pipe).Decode(&config); 
    populateProcessEnvironment(config.Env);
    switch t {
    ......
    case initStandard:
        return &linuxStandardInit{
            pipe:          pipe,
            consoleSocket: consoleSocket,
            parentPid:     unix.Getppid(),
            config:        config, // <=== config
            fifoFd:        fifoFd,
        }, nil
    }
    return nil, fmt.Errorf("unknown init type %q", t)
}

newContainerInit() 函數首先嘗試從 pipe 讀取配置存放到變量 config 中，再存儲到變量 linuxStandardInit 中返回

   runc create                    runc init 3
       |                               |
  p.sendConfig() --- config -->  NewContainerInit()

sleep 5 線索在 initStandard.config 中

回到 StartInitialization(),在得到 linuxStandardInit 后，便調用其 Init()方法了

/* init.go */
func (l *LinuxFactory) StartInitialization() (err error) {
    ......
    i, err := newContainerInit(it, pipe, consoleSocket, fifofd)

    return i.Init()  
}

本文忽略掉 Init() 方法前面的一大堆其他配置，只看其最后

func (l *linuxStandardInit) Init() error {
   ......
   name, err := exec.LookPath(l.config.Args[0])

   syscall.Exec(name, l.config.Args[0:], os.Environ())
}

可以看到，這里終于開始執行用戶最初設置的 sleep 5 了

文章版權歸作者所有，未經允許請勿轉載,若此文章存在違規行為，您可以聯系管理員刪除。

轉載請注明本文地址：http://specialneedsforspecialkids.com/yun/27647.html

探索 runC (上)

摘要：當前業內比較有名的有，等。至少在筆者的主機上是這樣。而第部加載，在上，就是返回一個結構。方法的實現如下第部分第部分上面的可分為兩部分調用方法用創建注意第二個參數是，表示新創建的會作為新創建容器的第一個。前言容器運行時(Container Runtime)是指管理容器和容器鏡像的軟件。當前業內比較有名的有docker，rkt等。如果不同的運行時只能支持各自的容器，那么顯然不利于整個容...

yanest 2019-07-01 17:34 評論0 收藏0
探索 runC (上)

摘要：當前業內比較有名的有，等。至少在筆者的主機上是這樣。而第部加載，在上，就是返回一個結構。方法的實現如下第部分第部分上面的可分為兩部分調用方法用創建注意第二個參數是，表示新創建的會作為新創建容器的第一個。前言容器運行時(Container Runtime)是指管理容器和容器鏡像的軟件。當前業內比較有名的有docker，rkt等。如果不同的運行時只能支持各自的容器，那么顯然不利于整個容...

Aomine 2019-06-28 16:56 評論0 收藏0
探索runC (下)

摘要：而不幸的是是多線程的。至此，子進程就從父進程處得到了的配置，繼續往下，又創建了兩個從注釋中了解到，這是為了和它自己的子進程和孫進程進行通信。回顧本文接探索runC(上) 前文講到，newParentProcess() 根據源自 config.json 的配置，最終生成變量 initProcess ，這個 initProcess 包含的信息主要有 cmd 記錄了要執行的可執行...

gekylin 2019-07-01 17:34 評論0 收藏0
runc 1.0-rc7 發布之際

摘要：在年月底時，我寫了一篇文章發布之際。為何有存在前面已經基本介紹了相關背景，并且也基本明確了就是在正式發布之前的最后一個版本，那為什么會出現呢我們首先要介紹今年的一個提權漏洞。在 18 年 11 月底時，我寫了一篇文章《runc 1.0-rc6 發布之際》。如果你還不了解 runc 是什么，以及如何使用它，請參考我那篇文章。本文中，不再對其概念和用法等進行說明。在 runc 1....

zhunjiee 2019-06-28 17:09 評論0 收藏0
runc 1.0-rc7 發布之際

摘要：在年月底時，我寫了一篇文章發布之際。為何有存在前面已經基本介紹了相關背景，并且也基本明確了就是在正式發布之前的最后一個版本，那為什么會出現呢我們首先要介紹今年的一個提權漏洞。在 18 年 11 月底時，我寫了一篇文章《runc 1.0-rc6 發布之際》。如果你還不了解 runc 是什么，以及如何使用它，請參考我那篇文章。本文中，不再對其概念和用法等進行說明。在 runc 1....

YanceyOfficial 2019-07-01 17:13 評論0 收藏0