模块介绍"/>
LINUX SECCOMP模块介绍
目录
SECCOMP介绍
SECCOMP-BPF
seccomp与capabilities的区别
SECCOMP在DOCKER中应用
关闭seccomp
关闭seccomp导致的安全问题
参考
SECCOMP介绍
Seccomp是 "secure computing" 的 缩写。Linux内核2.6.12版本(2005年3月8日)引入。是linux一个安全模块,用于限制程序系统调用;当时如果使用了SECCOMP只允许4个系统调用:
read,write,_exit,sigreturn
我们来看下例子
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/prctl.h>
#include <linux/seccomp.h>void configure_seccomp() {printf("Configuring seccomp\n");prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
}int main(int argc, char* argv[]) {int infd, outfd;if (argc < 3) {printf("Usage:\n\t%s <input path> <output_path>\n", argv[0]);return -1;}printf("Starting test seccomp Y/N?");char c = getchar();if (c == 'y' || c == 'Y') configure_seccomp();printf("Opening '%s' for reading\n", argv[1]);if ((infd = open(argv[1], O_RDONLY)) > 0) {ssize_t read_bytes;char buffer[1024];printf("Opening '%s' for writing\n", argv[2]);if ((outfd = open(argv[2], O_WRONLY | O_CREAT, 0644)) > 0) {while ((read_bytes = read(infd, &buffer, 1024)) > 0)write(outfd, &buffer, (ssize_t)read_bytes);}close(infd);close(outfd);}printf("End!\n");return 0;
}
使用下列命令编译
gcc seccomp.cpp -o seccomp
使用下列命令运行程序后,我们使用N不启用SECCOMP,发现将in.txt拷贝到了out.txt,说明拷贝成功。
我们如果使用Y,启用了SECCOMP,得到的结果如下所示,程序执行到25行,open文件时被Kill,也就是在SECCOMP模式下,我们运行了除了上面描述的
SECCOMP-BPF
Linux 3.5内核版本中, 引入seccomp第二种匹配模式:SECCOMP_MODE_FILTER。(以下Seccomp-BPF皆指seccomp的过滤模式)
而在该模式下,进程可以指定允许哪些系统调用,而不是像最开始的限制到4个系统调用中。过滤模式是通过使用Berkeley的数据包过滤器做过滤规则匹配,也就是这里的BPF。使用了seccomp-BPF的程序,必须具有此CAP_SYS_ADMIN权限;或者通过使用prctrl把no_new_priv设置bit 位设置成1
Seccomp与Capabilities的区别
两个都是安全方案,而seccomp对syscall调用限制,capability是进程权限集合,一个capability是权限的集合(root权限作为组,然后做了更细的划分),seccomp在capability前校验(有待校验)
capabilities一共 限制了39个系统能力:
CAP_AUDIT_CONTROL (since Linux 2.6.11)
CAP_AUDIT_READ (since Linux 3.16)
CAP_AUDIT_WRITE (since Linux 2.6.11)
CAP_BLOCK_SUSPEND (since Linux 3.5)
CAP_BPF (since Linux 5.8)
CAP_CHECKPOINT_RESTORE (since Linux 5.9)
CAP_CHOWN
CAP_DAC_OVERRIDE
CAP_DAC_READ_SEARCH
CAP_FOWNER
CAP_FSETID
CAP_IPC_LOCK
CAP_IPC_OWNER
CAP_KILL
CAP_LEASE (since Linux 2.4)
CAP_LINUX_IMMUTABLE
CAP_MAC_ADMIN (since Linux 2.6.25)
CAP_MAC_OVERRIDE (since Linux 2.6.25)
CAP_MKNOD (since Linux 2.4)
CAP_NET_ADMIN
CAP_NET_BIND_SERVICE
CAP_NET_BROADCAST
CAP_NET_RAW
CAP_PERFMON (since Linux 5.8)
CAP_SETGID
CAP_SETFCAP (since Linux 2.6.24)
CAP_SETPCAP
CAP_SETUID
CAP_SYS_ADMIN
CAP_SYS_BOOT
CAP_SYS_CHROOT
CAP_SYS_MODULE
CAP_SYS_NICE
CAP_SYS_PACCT
CAP_SYS_PTRACE
CAP_SYS_RAWIO
CAP_SYS_RESOURCE
CAP_SYSLOG (since Linux 2.6.37)
CAP_WAKE_ALARM (since Linux 3.0)
Seccomp是对系统接口的限制,也就是系统接口有多少个,Seccomp就能管理多少个。查看上面提到的unistd_64.h头文件,一共有427个(不同的Linux版本会有差异):
#define __NR_statx 332
#define __NR_io_pgetevents 333
#define __NR_rseq 334
#define __NR_io_uring_setup 425
#define __NR_io_uring_enter 426
#define __NR_io_uring_register 427#endif /* _ASM_X86_UNISTD_64_H */
容器中seccomp的使用
容器中 seccomp的使用,本质是对Seccomp-BPF的再封装使用;通过简单的配置文件来达快速设置多个容器的seccomp安全应用(以下全部以docker为例)。
docker中,通过配置一个profile.json文件来告知容器需要限制的系统 API,比如:
{"defaultAction": "SCMP_ACT_ALLOW","syscalls": [{"name": "mkdir","action": "SCMP_ACT_ERRNO","args": []}]
}
在这个配置文件中,默认情况下允许容器执行除“ mkdir”以外的全部系统调用。如 图:在容器内执行“ mkdir /home/test”生成新目录失败
而docker默认加载的seccomp配置内容在github上可以查看:.json
配置文件里面禁用了40+的系统调用,允许了300+的系统调用。 有点黑白名单的意思。
SECCOMP在DOCKER中应用
可以在下图看到,docker使用了SECCOMP禁用了44个SYSCALL
以下syscall已被docker默认的seccomp禁用,我们可以看到reboot被禁用,也就是docker中不能重启机器
Syscall | Description |
---|---|
acct | Accounting syscall which could let containers disable their own resource limits or process accounting. Also gated by CAP_SYS_PACCT . |
add_key | Prevent containers from using the kernel keyring, which is not namespaced. |
bpf | Deny loading potentially persistent bpf programs into kernel, already gated by CAP_SYS_ADMIN . |
clock_adjtime | Time/date is not namespaced. Also gated by CAP_SYS_TIME . |
clock_settime | Time/date is not namespaced. Also gated by CAP_SYS_TIME . |
clone | Deny cloning new namespaces. Also gated by CAP_SYS_ADMIN for CLONE_* flags, except CLONE_NEWUSER . |
create_module | Deny manipulation and functions on kernel modules. Obsolete. Also gated by CAP_SYS_MODULE . |
delete_module | Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE . |
finit_module | Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE . |
get_kernel_syms | Deny retrieval of exported kernel and module symbols. Obsolete. |
get_mempolicy | Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE . |
init_module | Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE . |
ioperm | Prevent containers from modifying kernel I/O privilege levels. Already gated by CAP_SYS_RAWIO . |
iopl | Prevent containers from modifying kernel I/O privilege levels. Already gated by CAP_SYS_RAWIO . |
kcmp | Restrict process inspection capabilities, already blocked by dropping CAP_SYS_PTRACE . |
kexec_file_load | Sister syscall of kexec_load that does the same thing, slightly different arguments. Also gated by CAP_SYS_BOOT . |
kexec_load | Deny loading a new kernel for later execution. Also gated by CAP_SYS_BOOT . |
keyctl | Prevent containers from using the kernel keyring, which is not namespaced. |
lookup_dcookie | Tracing/profiling syscall, which could leak a lot of information on the host. Also gated by CAP_SYS_ADMIN . |
mbind | Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE . |
mount | Deny mounting, already gated by CAP_SYS_ADMIN . |
move_pages | Syscall that modifies kernel memory and NUMA settings. |
name_to_handle_at | Sister syscall to open_by_handle_at . Already gated by CAP_DAC_READ_SEARCH . |
nfsservctl | Deny interaction with the kernel nfs daemon. Obsolete since Linux 3.1. |
open_by_handle_at | Cause of an old container breakout. Also gated by CAP_DAC_READ_SEARCH . |
perf_event_open | Tracing/profiling syscall, which could leak a lot of information on the host. |
personality | Prevent container from enabling BSD emulation. Not inherently dangerous, but poorly tested, potential for a lot of kernel vulns. |
pivot_root | Deny pivot_root , should be privileged operation. |
process_vm_readv | Restrict process inspection capabilities, already blocked by dropping CAP_SYS_PTRACE . |
process_vm_writev | Restrict process inspection capabilities, already blocked by dropping CAP_SYS_PTRACE . |
ptrace | Tracing/profiling syscall. Blocked in Linux kernel versions before 4.8 to avoid seccomp bypass. Tracing/profiling arbitrary processes is already blocked by dropping CAP_SYS_PTRACE , because it could leak a lot of information on the host. |
query_module | Deny manipulation and functions on kernel modules. Obsolete. |
quotactl | Quota syscall which could let containers disable their own resource limits or process accounting. Also gated by CAP_SYS_ADMIN . |
reboot | Don’t let containers reboot the host. Also gated by CAP_SYS_BOOT . |
request_key | Prevent containers from using the kernel keyring, which is not namespaced. |
set_mempolicy | Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE . |
setns | Deny associating a thread with a namespace. Also gated by CAP_SYS_ADMIN . |
settimeofday | Time/date is not namespaced. Also gated by CAP_SYS_TIME . |
stime | Time/date is not namespaced. Also gated by CAP_SYS_TIME . |
swapon | Deny start/stop swapping to file/device. Also gated by CAP_SYS_ADMIN . |
swapoff | Deny start/stop swapping to file/device. Also gated by CAP_SYS_ADMIN . |
sysfs | Obsolete syscall. |
_sysctl | Obsolete, replaced by /proc/sys. |
umount | Should be a privileged operation. Also gated by CAP_SYS_ADMIN . |
umount2 | Should be a privileged operation. Also gated by CAP_SYS_ADMIN . |
unshare | Deny cloning new namespaces for processes. Also gated by CAP_SYS_ADMIN , with the exception of unshare --user . |
uselib | Older syscall related to shared libraries, unused for a long time. |
userfaultfd | Userspace page fault handling, largely needed for process migration. |
ustat | Obsolete syscall. |
vm86 | In kernel x86 real mode virtual machine. Also gated by CAP_SYS_ADMIN . |
vm86old | In kernel x86 real mode virtual machine. Also gated by CAP_SYS_ADMIN . |
我们增加reboot后,我们来看看是否可以在容器中重启
docker run --rm \-it \--security-opt seccomp=/home/profile.json \hello-world
关闭seccomp
docker run -it --security-opt seccomp=unconfined ubuntu:latest
关闭seccomp导致的安全问题
关闭seccomp会增大docker攻击面,在默认情况下禁用了部分syscall,而这些syscall如果开启会增大攻击面,因为当有这样的syscall,就增多了一种攻击面(增多了一个系统调用路径,例如该调用存在溢出漏洞),举个例子,CVE-2022-0185就是这样一个漏洞,通过unshare系统(unshare -Urm)调用拿到sys_admin权限,通过unshare增加了进程的capabiltiy,实验如下
sh-3.2# docker run -it --security-opt seccomp=unconfined centos:latest
[root@93bb4e20b766 /]#
[root@93bb4e20b766 /]# capsh --print
Current: = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap+ep
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap
Ambient set =
Securebits: 00/0x0/1'b0secure-noroot: no (unlocked)secure-no-suid-fixup: no (unlocked)secure-keep-caps: no (unlocked)secure-no-ambient-raise: no (unlocked)
uid=0(root)
gid=0(root)
groups=0(root)[root@93bb4e20b766 /]# unshare -Urm
[root@93bb4e20b766 /]# capsh --print
Current: = cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,38,39,40+ep
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,38,39,40
Ambient set =
Securebits: 00/0x0/1'b0secure-noroot: no (unlocked)secure-no-suid-fixup: no (unlocked)secure-keep-caps: no (unlocked)secure-no-ambient-raise: no (unlocked)
uid=0(root)
gid=0(root)
groups=0(root)
[root@93bb4e20b766 /]#
具体可以参考文章 CVE-2022-0185 价值$3w的 File System Context 内核整数溢出漏洞利用分析_bsauce的博客-CSDN博客
参考
Introduction to seccomp: BPF linux syscall filter - tycoon3 - 博客园 (cnblogs)
浅谈Linux SECCOMP安全机制在容器中的使用 - 腾讯云开发者社区-腾讯云 (tencent)
The Route to Host:从内核提权到容器逃逸 – 绿盟科技技术博客 (nsfocus)
Seccomp、BPF与容器安全 - 先知社区 (aliyun)
探究K8S v1.19 GA的Seccomp - 知乎 (zhihu)
云原生安全 — seccomp应用最佳实践-阿里云开发者社区 (aliyun)
Seccomp security profiles for Docker | Docker Documentation
Restrict a Container's Syscalls with seccomp | Kubernetes
seccomp - Wikipedia
/
容器安全之CVE-2022-0185_新闻中心-网盾网络安全培训中心
capabilities - Difference between linux capabities and seccomp - Information Security Stack Exchange
云原生安全 — seccomp应用最佳实践-阿里云开发者社区
CVE-2022-0185 价值$3w的 File System Context 内核整数溢出漏洞利用分析_bsauce的博客-CSDN博客
更多推荐
LINUX SECCOMP模块介绍
发布评论