GSoC 2021 - CRIU


Project: Use eBPF nftables to lock/unlock the network

Mentor: Radostin Stoyanov @rst0git

Organization: CRIU

GSoC Page


Acknowledgment

All CRIU members always gave great, constructive & fast feedback, especially Radostin (@rst0git) (my mentor) who guided me through all technical challenges I faced and was very responsive and helpful with his feedback all the time. It has been a great learning experience. I am grateful for this opportunity.

CRIU’s community and GSoC opened my eyes on the immense learning experience that one could gain while participating in open source projects.

Thank you!


Table of contents


Overview

During checkpointing and restoring, CRIU locks the network to make sure no TCP packets are accepted by the network stack during the time the process is checkpointed (to prevent the kernel from sending a RST packet). Currently CRIU calls out to iptables-restore to create and delete the corresponding iptables rules (external binary dependency).

There have been reports from users that iptables-restore fails in some way and eBPF nftables could avoid this external dependency.

As you can probably tell from the strike-through in the title, eBPF was the initial approach, and a lot of work/research went into it, but due to some technical limitations which will be described below, using eBPF was not a viable option. After some more research and feedback from Radostin (@rst0git), we settled on using nftables instead, which turned out to be a simpler solution, less prone to errors and much easier to maintain.

This final approach uses nftables as an alternative to iptables-restore to lock/unlock the network. Using libnftables eliminates the external iptables-restore binary dependency, also nftables is much more efficient than iptables.


Summary of my work with CRIU

Before and During the ~3 months of GSoC I worked on several improvements/issues besides the network locking/unlocking project. A full list of those PRs can be found here

The highlights of those PRs were:

The issues I stumbled upon, trying to fix/improve them and the feedback/suggestions I got from CRIU’s community was great learning experience.

Also it was a ton of fun. Sometimes, I needed to look into the source code of Linux and other networking tools to figure out something. It got me excited to dig deeper into Linux kernel internals and low level networking.


Initial solution (eBPF)

The initial proposed solution was to use eBPF to lock/unlock the network. A lot of work and research was put into eBPF during writing my proposal, and during the first month of GSoC.

For the first ~2 weeks of GSoC, I started getting familiar with RTNETLINK and eBPF assembly, specially loading SCHED_CLS eBPF (TC) programs to a network interface. I was able to create a rough functional prototype locally for eBPF based locking in CRIU that passed all tests including the latest network-wide locking test which I wrote during writing the proposal.

During the 3rd week I started designing the execution flow of how this new network locking method would fit into CRIU, I tried to compile my ideas in details into this document and shared it on CRIU’s gitter channel for feedback.

Kumar Kartikeya (@kkdwivedi) gave great feedback and pointed out problems in the eBPF approach:

Those situations are not common but might happen and should be accounted for. Unfortunately this is a fundamental limitation of TC and I researched a lot and didn’t find suitable workarounds.

Since the main benefit of the project was using an alternative to iptables to avoid the external binary dependency, I started looking for alternatives and nftables (using libnftables) seemed like a viable option. Turns out that the nftables based solution is a good alternative that is simpler, less prone to errors and easier to maintain. It was time to start over again, but the work and research put into eBPF was definitely a great learning experience. I learned a ton about eBPF, traffic control and Linux internals.


Current solution (nftables)

I will go into details of the nftables approach but for more information, this PR contains most of the feedback and discussions with the CRIU community.

Testing

I wrote tests for network-wide locking and per-socket locking before adding the actual nftables implementation. Since the iptables approach was already there I was able to verify that the tests were valid. This method made the feedback loop much faster, as I was able to detect problems early on.

Related commits:

Kerndat

kerndat is a struct in CRIU used to check if needed kernel features exist

We need to check that nftables concatenations is supported [has_nftables_concat]

Related commits:

Feature check

criu check --feature network-lock-nftables

Add ability to check if nftables based locking/unlocking is possible.

This checks the corresponding kerndat fields.

Related commits:

Add –network-lock option

After nftables based locking/unlocking we would have two values available for this option (iptables and nftables)

It would be better to default to iptables until nftables method is tested thoroughly.

e.g. criu dump -t 2221 --tcp-established --network-lock iptables

criu dump -t 2221 --tcp-established --network-lock nftables

The corresponding RPC and libcriu options have been added as well.

Related commits:

Algorithm flow

I used the same hooks that invoke iptables locking/unlocking.

i.e. network_lock_internal, network_unlock_internal for network-wide case

nf_lock_connection for per-socket case (renamed to lock_connection)

Note: Per-socket rule should be loaded in network_lock as lock_connection will just add connection tuple entries to the associated nftables set.

CRIU would just decide in those hooks which method should be used based on –network-lock option opts.network_lock_method.

if (opts.network_lock_method == NETWORK_LOCK_IPTABLES)
		ret |= iptables_network_unlock_internal();
else if (opts.network_lock_method == NETWORK_LOCK_NFTABLES)
		ret |= nftables_network_unlock_internal();

While this approach abstracts away the network locking/unlocking details, both should have identical behavior anyway and should be interchangeable seamlessly with the –network-lock option.

Locking (netns-wide)

  1. Create a table named CRIU

    Similar to nft create table inet CRIU

  2. Create INPUT/OUTPUT chains with a default “drop” policy

    Similar to nft add chain inet CRIU output { type filter hook output priority 0 ; policy drop; }

  3. Create a rule to “accept” packets with SOCCR_MARK

    Similar to nft add rule inet CRIU output meta mark 0xC114 accept

Note: the root task PID is appended to the table name to avoid having two instances of CRIU create the same table causing them to conflict.

Related commits:

Locking (per-socket)

Preparation

  1. Create a table named CRIU

    Similar to nft create table inet CRIU

  2. Create INPUT/OUTPUT chains

    Similar to nft add chain inet CRIU output { type filter hook output priority 0 }

  3. Create a rule to “accept” packets with SOCCR_MARK

    Similar to nft add rule inet CRIU output meta mark 0xC114 accept

  4. Create connections set, which should contain connection identifying tuples (src_addr, src_port, dst_addr, dst_port)

    This is possible due to concatenations support for nftables sets which requires kernel ≥ 4.1 https://wiki.nftables.org/wiki-nftables/index.php/Concatenations.

    Similar to nft add set inet CRIU conns { type ipv4_addr . inet_service . ipv4_addr . inet_service ;}

  5. Create a rule to “drop” packets that match connections in the conns set

    Similar to nft add rule inet CRIU output ip saddr . tcp sport . ip daddr . tcp dport @conns drop

Note: A separate set/rule is added for IPv6.

Locking one socket

CRIU just adds the connection identifying tuple to the conns set in nftables_lock_connection.

Similar to nft add element inet CRIU conns { 192.168.1.96 . 46315 . 1.1.1.1 . 53 }

Nftables sets are very efficient and avoids adding a separate rule for each connection, we only need to add connection entries to the conns set.

Related commits:

Unlocking

With nftables the process is much easier, we only need to delete the CRIU table and that is it.

Similar to nft delete table inet CRIU

Option documentation

A new criu.org page was created to document the –network-lock option. https://criu.org/CLI/opt/–network-lock


List of commits

Unrelated to GSoC (During GSoC)

Unrelated to GSoC (Before GSoC)


Next steps

Radostin (@rst0git) suggested we could update go-criu with support for network_lock and perhaps enable support in runc and/or crun. Once the PR is merged I could start working on those.