Docker meet firewall - finally an answer

One of the most annoying things with Docker has been how it interacts with iptables. And ufw. And firewalld. Most firewall solutions on Linux assume they are the source of truth. But increasingly thats not a sensible assumption. This inevitably leads to collisions - restarting the firewall or Docker will end up clobbering something. Whilst it was possible to get it working, it was a pain. And always a bit dirty. I don’t want to have to restart Docker after tweaking my firewall! Recently a new solution has presented itself and it looks like things are going to get a lot better:

In Docker 17.06 and higher, you can add rules to a new table called DOCKER-USER, and these rules will be loaded before any rules Docker creates automatically. This can be useful if you need to pre-populate iptables rules that need to be in place before Docker runs.

You can read more about it in the pull request that added it.

So how do we make use of that? Searching for an answer is still hard - there are 3 years of people scrambling to work around the issue and not many posts like this one yet. But by the end of this post you will have an iptables based firewall that doesn’t clobber Docker when you apply it. Docker won’t clobber it either. And it will make it easier to write rules that apply to non-container ports and container ports alike.

Starting from an Ubuntu 16.04 VM that has Docker installed but has never had an explicit firewall setup before. If you’ve had any other sort of Docker firewall before, undo those changes. Docker should be allowed to do its own iptables rules. Don’t change the FORWARD chain to ACCEPT from DROP. There is no need any more. On a clean environment before any of our changes this is what iptables-save looks like:

$ sudo iptables-save
# Generated by iptables-save v1.6.0 on Tue Aug 15 04:02:08 2017
*nat
:PREROUTING ACCEPT [1:64]
:INPUT ACCEPT [1:64]
:OUTPUT ACCEPT [8:488]
:POSTROUTING ACCEPT [10:616]
:DOCKER - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A POSTROUTING -s 172.19.0.0/16 ! -o br-68428f03a4d1 -j MASQUERADE
-A POSTROUTING -s 172.18.0.0/16 ! -o docker_gwbridge -j MASQUERADE
-A POSTROUTING -s 172.19.0.2/32 -d 172.19.0.2/32 -p tcp -m tcp --dport 9200 -j MASQUERADE
-A DOCKER -i docker0 -j RETURN
-A DOCKER -i br-68428f03a4d1 -j RETURN
-A DOCKER -i docker_gwbridge -j RETURN
-A DOCKER ! -i br-68428f03a4d1 -p tcp -m tcp --dport 9200 -j DNAT --to-destination 172.19.0.2:9200
COMMIT
# Completed on Tue Aug 15 04:02:08 2017
# Generated by iptables-save v1.6.0 on Tue Aug 15 04:02:08 2017
*filter
:INPUT ACCEPT [174:13281]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [138:16113]
:DOCKER - [0:0]
:DOCKER-ISOLATION - [0:0]
:DOCKER-USER - [0:0]
-A FORWARD -j DOCKER-USER
-A FORWARD -j DOCKER-ISOLATION
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A FORWARD -o br-68428f03a4d1 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o br-68428f03a4d1 -j DOCKER
-A FORWARD -i br-68428f03a4d1 ! -o br-68428f03a4d1 -j ACCEPT
-A FORWARD -i br-68428f03a4d1 -o br-68428f03a4d1 -j ACCEPT
-A FORWARD -o docker_gwbridge -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker_gwbridge -j DOCKER
-A FORWARD -i docker_gwbridge ! -o docker_gwbridge -j ACCEPT
-A FORWARD -i docker_gwbridge -o docker_gwbridge -j DROP
-A DOCKER -d 172.19.0.2/32 ! -i br-68428f03a4d1 -o br-68428f03a4d1 -p tcp -m tcp --dport 9200 -j ACCEPT
-A DOCKER-ISOLATION -i br-68428f03a4d1 -o docker0 -j DROP
-A DOCKER-ISOLATION -i docker0 -o br-68428f03a4d1 -j DROP
-A DOCKER-ISOLATION -i docker_gwbridge -o docker0 -j DROP
-A DOCKER-ISOLATION -i docker0 -o docker_gwbridge -j DROP
-A DOCKER-ISOLATION -i docker_gwbridge -o br-68428f03a4d1 -j DROP
-A DOCKER-ISOLATION -i br-68428f03a4d1 -o docker_gwbridge -j DROP
-A DOCKER-ISOLATION -j RETURN
-A DOCKER-USER -j RETURN
COMMIT

The main points to note are that INPUT has been left alone by Docker and that there is (as documented) a DOCKER-USER chain that has been set up for us. All traffic headed to a container goes to the FORWARD chain and this lets DOCKER-USER filter that traffic before the Docker rules are applied.

A firewall that doesn’t smoosh Docker iptables rules

So a super simple firewall. Create a new /etc/iptables.conf that looks like this:

*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:FILTERS - [0:0]
:DOCKER-USER - [0:0]

-F INPUT
-F DOCKER-USER
-F FILTERS

-A INPUT -i lo -j ACCEPT
-A INPUT -p icmp --icmp-type any -j ACCEPT
-A INPUT -j FILTERS

-A DOCKER-USER -i ens33 -j FILTERS

-A FILTERS -m state --state ESTABLISHED,RELATED -j ACCEPT
-A FILTERS -m state --state NEW -s 1.2.3.4/32
-A FILTERS -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT
-A FILTERS -m state --state NEW -m tcp -p tcp --dport 23 -j ACCEPT
-A FILTERS -m state --state NEW -m tcp -p tcp --dport 80 -j ACCEPT
-A FILTERS -m state --state NEW -m tcp -p tcp --dport 443 -j ACCEPT
-A FILTERS -j REJECT --reject-with icmp-host-prohibited

COMMIT

You can load it into the kernel with:

iptables-restore -n /etc/iptables.conf

That -n flag is crucial to avoid breaking Docker.

Whats going on here?

  • This firewall avoids touching areas Docker is likely to interfere with. You can restart Docker over and over again and it will not harm or hinder our rules in INPUT, DOCKER-USER or FILTERS.

  • We explicitly flush INPUT, DOCKER-USER and FILTERS. This means we don’t end up smooshing 2 different versions of our iptables.conf together. Normally this is done implicitly by iptables-restore. But its that implicit flush that that clobbers the rules that Docker manages. So we will only ever load this config with iptables-restore -n /etc/iptables.conf. The -n flag turns off the implicit global flush and only does our manual explicit flush. The Docker rules are preserved - no more restarting Docker when you change your firewall.

  • We have an explicit FILTERS chain. This is used by the INPUT chain. But Docker traffic actually goes via the FORWARD chain. And thats why ufw has always been problematic. This is where DOCKER-USER comes in. We just add a rule that passes any traffic from the external physical network interface to our FILTERS chain. This means that when I want to allow my home IP (in this example 1.2.3.4) access to every port I update the FILTERS chain once. I don’t have to add a rule in INPUT and a rule in DOCKER-USER. I don’t have to think about which part of the firewall a rule will or won’t work in. My FILTERS chain is the place to go.

Starting the firewall at boot

You can load this firewall at boot with systemd. Add a new unit - /etc/system/system/iptables.service:

[Unit]
Description=Restore iptables firewall rules
Before=network-pre.target

[Service]
Type=oneshot
ExecStart=/sbin/iptables-restore -n /etc/iptables.conf

[Install]
WantedBy=multi-user.target

And enable it:

$ sudo systemctl enable --now iptables

The firewall is now active, and it didn’t smoosh your docker managed iptables rules. You can reboot and the firewall will come up as it is right now.

Updating the firewall

Pop open the firwall in your favourite text editor, add or remove a rule from the FILTERS section, then reload the firewall with:

$ sudo systemctl restart iptables

Starting services on hotplug

I want systemd to start a service when a USB device is plugged in and stop it when i remove it.

Use systemctl to get a list of units:

$ systemctl
UNIT                                       LOAD   ACTIVE SUB       DESCRIPTION
sys-subsystem-net-devices-gamelink0.device loaded active plugged   PL25A1 Host-Host Bridge
<snip>

There’s no configuration required here - the .device unit just appears in response to udev events without any configuration. I’ve previously set up udev rules so that my USB host-to-host cable is consistently named gamelink0, but even without that it would show up under its default name.

The simplest way is just to take advantage of WantedBy. In gamelink.service:

$ cat /etc/systemd/systemd/gamelink.service
[Unit]
Description = Gamelink cable autoconf
After=sys-subsystem-net-devices-gamelink0.device
BindsTo=sys-subsystem-net-devices-gamelink0.device

[Service]
Type=simple
Environment=PYTHONUNBUFFERED=1
ExecStart=/usr/bin/python3 /home/john/gamelink.py

[Install]
WantedBy = sys-subsystem-net-devices-gamelink0.device

And then install it:

$ systemctl enable gamelink
Created symlink from /etc/systemd/system/sys-subsystem-net-devices-gamelink0.device.wants/gamelink.service to /etc/systemd/system/gamelink.service.

The WantedBy directive tells systemctl enable to drop the symlink in a .wants directory for the device. Whenever a new unit becomes active systemd will look in .wants to see what other related services needs to be started, and that applies to .device units just as much as .service units or .target units. That behaviour is all we need to start our daemon on hotplug.

The BindsTo directive lets us stop the service when the .device unit goes away (i.e. the device is unplugged). Used in conjunction with After we ensure that the service may never be in active state without a specific device unit also in active state.

Multi-core twisted with systemd socket activation

With a stateless Twisted app you can scale by adding more instances. Unless you are explicitly offloading to subprocesses you will often have spare cores on the same box as your existing instance. But to exploit them you end up running haproxy or faking a load balancer with iptables.

With the SO_REUSEPORT socket flag multiple processes can listen on the same port, but this isn’t available from twisted (yet). But with systemd and socket activation we can use it today.

As a proof of concept we’ll make a 4 core HTTP static web service. In a fresh Ubuntu 16.04 VM install python-twisted-web.

In /etc/systemd/system/[email protected]:

[Unit]
Description=Socket for worker %i

[Socket]
ListenStream = 8080
ReusePort = yes
Service = [email protected]%i.service

[Install]
WantedBy = sockets.target

And in /etc/systemd/system/[email protected]

[Unit]
Description=Worker %i
[email protected]%i.socket

[Service]
Type=simple
ExecStart=/usr/bin/twistd --nodaemon --logfile=- --pidfile= web --port systemd:domain=INET:index:0 --path /tmp
NonBlocking=yes
User=nobody
Group=nobody
Restart=always

Then to get 4 cores:

$ systemctl enable --now [email protected]
$ systemctl enable --now [email protected]
$ systemctl enable --now [email protected]
$ systemctl enable --now [email protected]

Lets test it. In a python shell:

import urllib
import time

while True:
    urllib.urlopen('http://172.16.140.136:8080/').read()
    time.sleep(1)

And in another terminal you can tail the logs with journalctl:

$ sudo journalctl -f -u [email protected]*.service
Apr 26 02:43:51 ubuntu twistd[10441]: 2017-04-26 02:43:51-0700 [-] - - - [26/Apr/2017:09:43:51 +0000] "GET / HTTP/1.0" 200 2081 "-" "Python-urllib/1.17"
Apr 26 02:43:52 ubuntu twistd[10441]: 2017-04-26 02:43:52-0700 [-] - - - [26/Apr/2017:09:43:52 +0000] "GET / HTTP/1.0" 200 2081 "-" "Python-urllib/1.17"
Apr 26 02:43:53 ubuntu twistd[10444]: 2017-04-26 02:43:53-0700 [-] - - - [26/Apr/2017:09:43:53 +0000] "GET / HTTP/1.0" 200 2081 "-" "Python-urllib/1.17"
Apr 26 02:43:54 ubuntu twistd[10452]: 2017-04-26 02:43:54-0700 [-] - - - [26/Apr/2017:09:43:54 +0000] "GET / HTTP/1.0" 200 2081 "-" "Python-urllib/1.17"
Apr 26 02:43:55 ubuntu twistd[10452]: 2017-04-26 02:43:55-0700 [-] - - - [26/Apr/2017:09:43:55 +0000] "GET / HTTP/1.0" 200 2081 "-" "Python-urllib/1.17"
Apr 26 02:43:56 ubuntu twistd[10447]: 2017-04-26 02:43:56-0700 [-] - - - [26/Apr/2017:09:43:56 +0000] "GET / HTTP/1.0" 200 2081 "-" "Python-urllib/1.17"
Apr 26 02:43:57 ubuntu twistd[10450]: 2017-04-26 02:43:57-0700 [-] - - - [26/Apr/2017:09:43:57 +0000] "GET / HTTP/1.0" 200 2081 "-" "Python-urllib/1.17"

As you can see the twisted[pid] changes as different cores handle requests.

If you deploy new code you can systemctl restart [email protected]*.service to restart all cores.

systemctl enable will mean the 4 cores are available on next boot, too.

Building the Linux kernel on a Mac inside Docker: Attempt #2

Todays failure is about xargs. For whatever reason inside a qemu-user-static environment inside docker it can no longer do its part to help build a kernel:

  CLEAN   arch/arm/boot
/usr/bin/xargs: rm: Argument list too long
Makefile:1502: recipe for target 'clean' failed
make[2]: *** [clean] Error 126
make[2]: Leaving directory '/src/debian/linux-source-4.7.0/usr/src/linux-source-4.7.0'
debian/ruleset/targets/source.mk:35: recipe for target 'debian/stamp/install/linux-source-4.7.0' failed
make[1]: *** [debian/stamp/install/linux-source-4.7.0] Error 2
make[1]: Leaving directory '/src'
debian/ruleset/local.mk:96: recipe for target 'kernel_source' failed

It looks like I need to patch the kernels Makefile to workaround some limit qemu is introducing.

Building the Linux kernel on a Mac inside Docker: Attempt #1

I’ve recently been using my Docker for Mac install to try my hand at packaging the latest upstream kernel for my Raspberry Pi. I have a Debian Jessie ARM rootfs with kernel-package installed and the ideas was to run make-kpkg on an OSX-local checkout. It was able to build the main kernel but then:

make[3]: *** Documentation/Kbuild: Is a directory.  Stop.
Makefile:1260: recipe for target '_clean_Documentation' failed
make[2]: *** [_clean_Documentation] Error 2
make[2]: Leaving directory '/src/debian/linux-source-4.7.0/usr/src/linux-source-4.7.0'
debian/ruleset/targets/source.mk:35: recipe for target 'debian/stamp/install/linux-source-4.7.0' failed
make[1]: *** [debian/stamp/install/linux-source-4.7.0] Error 2
make[1]: Leaving directory '/src'
debian/ruleset/local.mk:96: recipe for target 'kernel_source' failed
make: *** [kernel_source] Error 2

The moral of this story is to not try and build kernels on case insensitive filesystems.