GSoC 2021 - CRIU
Project: Use eBPF nftables to lock/unlock the network
Mentor: Radostin Stoyanov @rst0git
Organization: CRIU
GSoC Page
Acknowledgment
All CRIU members always gave great, constructive & fast feedback, especially Radostin (@rst0git) (my mentor) who guided me through all technical challenges I faced and was very responsive and helpful with his feedback all the time. It has been a great learning experience. I am grateful for this opportunity.
CRIU’s community and GSoC opened my eyes on the immense learning experience that one could gain while participating in open source projects.
Thank you!
Table of contents
- Overview
- Summary of my work with CRIU
- Initial solution (eBPF)
- Current solution (nftables)
- List of commits
- Next steps
Overview
During checkpointing and restoring, CRIU locks the network to make sure no TCP packets are accepted by the network stack during the time the process is checkpointed (to prevent the kernel from sending a RST packet). Currently CRIU calls out to iptables-restore to create and delete the corresponding iptables rules (external binary dependency).
There have been reports from users that iptables-restore fails in some way and eBPF nftables could avoid this external dependency.
As you can probably tell from the strike-through in the title, eBPF was the initial approach, and a lot of work/research went into it, but due to some technical limitations which will be described below, using eBPF was not a viable option. After some more research and feedback from Radostin (@rst0git), we settled on using nftables instead, which turned out to be a simpler solution, less prone to errors and much easier to maintain.
This final approach uses nftables as an alternative to iptables-restore to lock/unlock the network. Using libnftables eliminates the external iptables-restore binary dependency, also nftables is much more efficient than iptables.
Summary of my work with CRIU
Before and During the ~3 months of GSoC I worked on several improvements/issues besides the network locking/unlocking project. A full list of those PRs can be found here
The highlights of those PRs were:
- [GSoC] Add nftables based network locking/unlocking
- Fix protobuf-c v1.4.0 breaking CRIU’s build
- Add pidfd based pid reuse detection for improved reliability
- Optimize unix socket lookups using hashtables
The issues I stumbled upon, trying to fix/improve them and the feedback/suggestions I got from CRIU’s community was great learning experience.
Also it was a ton of fun. Sometimes, I needed to look into the source code of Linux and other networking tools to figure out something. It got me excited to dig deeper into Linux kernel internals and low level networking.
Initial solution (eBPF)
The initial proposed solution was to use eBPF to lock/unlock the network. A lot of work and research was put into eBPF during writing my proposal, and during the first month of GSoC.
For the first ~2 weeks of GSoC, I started getting familiar with RTNETLINK and eBPF assembly, specially loading SCHED_CLS eBPF (TC) programs to a network interface. I was able to create a rough functional prototype locally for eBPF based locking in CRIU that passed all tests including the latest network-wide locking test which I wrote during writing the proposal.
During the 3rd week I started designing the execution flow of how this new network locking method would fit into CRIU, I tried to compile my ideas in details into this document and shared it on CRIU’s gitter channel for feedback.
Kumar Kartikeya (@kkdwivedi) gave great feedback and pointed out problems in the eBPF approach:
- Possible race conditions when creating the clsact qdiscs as another process could create a qdisc immediately after CRIU checks that no clsact qdiscs exist.
- Qdsics are not NAT aware as TC is at the earliest stage of packet life-cycle in case of ingress and the latest stage in case of egress as illustrated here so TC could miss the packets if NATing happens before hitting the qdisc.
Those situations are not common but might happen and should be accounted for. Unfortunately this is a fundamental limitation of TC and I researched a lot and didn’t find suitable workarounds.
Since the main benefit of the project was using an alternative to iptables to avoid the external binary dependency, I started looking for alternatives and nftables (using libnftables) seemed like a viable option. Turns out that the nftables based solution is a good alternative that is simpler, less prone to errors and easier to maintain. It was time to start over again, but the work and research put into eBPF was definitely a great learning experience. I learned a ton about eBPF, traffic control and Linux internals.
Current solution (nftables)
I will go into details of the nftables approach but for more information, this PR contains most of the feedback and discussions with the CRIU community.
Testing
I wrote tests for network-wide locking and per-socket locking before adding the actual nftables implementation. Since the iptables approach was already there I was able to verify that the tests were valid. This method made the feedback loop much faster, as I was able to detect problems early on.
Related commits:
Kerndat
kerndat is a struct in CRIU used to check if needed kernel features exist
We need to check that nftables concatenations is supported [has_nftables_concat]
Related commits:
Feature check
criu check --feature network-lock-nftables
Add ability to check if nftables based locking/unlocking is possible.
This checks the corresponding kerndat fields.
Related commits:
Add –network-lock option
After nftables based locking/unlocking we would have two values available for this option (iptables
and nftables
)
It would be better to default to iptables
until nftables method is tested thoroughly.
e.g.
criu dump -t 2221 --tcp-established --network-lock iptables
criu dump -t 2221 --tcp-established --network-lock nftables
The corresponding RPC and libcriu options have been added as well.
Related commits:
Algorithm flow
I used the same hooks that invoke iptables locking/unlocking.
i.e.
network_lock_internal
, network_unlock_internal
for network-wide case
nf_lock_connection
for per-socket case (renamed to lock_connection
)
Note: Per-socket rule should be loaded in network_lock
as lock_connection
will just add connection tuple entries to the associated nftables set.
CRIU would just decide in those hooks which method should be used based on –network-lock option opts.network_lock_method
.
if (opts.network_lock_method == NETWORK_LOCK_IPTABLES)
ret |= iptables_network_unlock_internal();
else if (opts.network_lock_method == NETWORK_LOCK_NFTABLES)
ret |= nftables_network_unlock_internal();
While this approach abstracts away the network locking/unlocking details, both should have identical behavior anyway and should be interchangeable seamlessly with the –network-lock option.
Locking (netns-wide)
-
Create a table named CRIU
Similar to
nft create table inet CRIU
-
Create INPUT/OUTPUT chains with a default “drop” policy
Similar to
nft add chain inet CRIU output { type filter hook output priority 0 ; policy drop; }
-
Create a rule to “accept” packets with SOCCR_MARK
Similar to
nft add rule inet CRIU output meta mark 0xC114 accept
Note: the root task PID is appended to the table name to avoid having two instances of CRIU create the same table causing them to conflict.
Related commits:
Locking (per-socket)
Preparation
-
Create a table named CRIU
Similar to
nft create table inet CRIU
-
Create INPUT/OUTPUT chains
Similar to
nft add chain inet CRIU output { type filter hook output priority 0 }
-
Create a rule to “accept” packets with SOCCR_MARK
Similar to
nft add rule inet CRIU output meta mark 0xC114 accept
-
Create connections set, which should contain connection identifying tuples (src_addr, src_port, dst_addr, dst_port)
This is possible due to concatenations support for nftables sets which requires kernel ≥ 4.1 https://wiki.nftables.org/wiki-nftables/index.php/Concatenations.
Similar to
nft add set inet CRIU conns { type ipv4_addr . inet_service . ipv4_addr . inet_service ;}
-
Create a rule to “drop” packets that match connections in the
conns
setSimilar to
nft add rule inet CRIU output ip saddr . tcp sport . ip daddr . tcp dport @conns drop
Note: A separate set/rule is added for IPv6.
Locking one socket
CRIU just adds the connection identifying tuple to the conns
set in nftables_lock_connection
.
Similar to nft add element inet CRIU conns { 192.168.1.96 . 46315 . 1.1.1.1 . 53 }
Nftables sets are very efficient and avoids adding a separate rule for each connection, we only need to add connection entries to the conns
set.
Related commits:
Unlocking
With nftables the process is much easier, we only need to delete the CRIU table and that is it.
Similar to nft delete table inet CRIU
Option documentation
A new criu.org page was created to document the –network-lock option. https://criu.org/CLI/opt/–network-lock
List of commits
Related to GSoC project
- zdtm: add network namespace locking test
- test/jenkins: fix netns_lock test multiple iterations failure
- test/ci: sync netns_lock test and its –post-start hook
- criu: add –network-lock option to allow nftables alternative
- cr-service: add network_lock option to RPC and libcriu
- cr-check: add check for nftables based network locking
- criu: rename iptables network locking/unlocking functions
- criu: add nftables netns-wide locking/unlocking
- zdtm: add nftables network namespace locking test
- zdtm: add iptables per-socket locking test
- criu: add nftables connection locking/unlocking
- zdtm: add nftables per-socket locking test
- zdtm: add ipv6 variants of net_lock_socket_* tests
- inventory: save network lock method to reuse in restore
- criu: use unique table names for nftables based locking
Unrelated to GSoC (During GSoC)
- pidfd_store: move pidfd_store to a separate file
- pidfd_store: tidy up interface and hide unneeded details
- cr-service: move pidfd_store initialization to cr-service
- protobuf: remove leading underscores from protobuf structs
- scripts/build: add a docker file for archlinux
- zdtm: prioritize /lib/* dependencies in some tests
Unrelated to GSoC (Before GSoC)
- criu: optimize find_unix_sk_by_ino()
- ci: run zdtm/transition/pid_reuse with pre-dumps in ci tests
- phaul: fix infinite pre-dump iterations in Migrate()
- cr-service: fix CRIU_REQ_TYPE__FEATURE_CHECK RPC request
- criu: check if pidfd_open syscall is supported
- criu: check if pidfd_getfd syscall is supported
- cr-service: add pidfd_store_sk option to rpc.proto
- cr-check: add ability to check if pidfd_store feature is supported
- criu: add pidfd based pid reuse detection for RPC clients
- zdtm: add –pidfd-store option in RPC mode
- zdtm: add pidfd store based pid reuse test
Next steps
Radostin (@rst0git) suggested we could update go-criu with support for network_lock and perhaps enable support in runc and/or crun. Once the PR is merged I could start working on those.