The current weekly setting causes every NixOS server to try to renew
its certificate at midnight on the dot on Monday. This contributes to
the general problem of periodic load spikes for Let's Encrypt; NixOS
is probably not a major contributor to that problem, but we can lead by
example by picking good defaults here.
The values here were chosen after consulting with @yuriks, an SRE at
Let's Encrypt:
* Randomize the time certificates are renewed within a 24 hour period.
* Check for renewal every 24 hours, to ensure the certificate is always
renewed before an expiry notice is sent out.
* Increase the AccuracySec (thus lowering the accuracy(!)), so that
systemd can coalesce the renewal with other timers being run.
(You might be worried that this would defeat the purpose of the time
skewing, but systemd is documented as avoiding this by picking a
random time.)
lego already bundles the chain with the certificate,[1] so the current
code, designed for simp_le, was resulting in duplicate certificate
chains, manifesting as "Chain issues: Incorrect order, Extra certs" on
the Qualys SSL Server Test.
cert.pem stays around as a symlink for backwards compatibility.
[1] 5cdc0002e9/acme/api/certificate.go (L40-L44)
Lego allows users to use the DNS-01 challenge to validate their
certificates. It is mostly backwards compatible, with a few
caveats.
- extraDomains can no longer have different webroots to the
main webroot for the cert.
- An email address is now mandatory for account creation
The following other changes were required:
- Deprecate security.acme.certs.<name>.plugins, as this was
specific to simp-le
- Rename security.acme.validMin to validMinDays, to avoid
confusion and errors. Lego requires the TTL to be specified in
days
- Add options to cover DNS challenge (dnsProvider,
credentialsFile, dnsPropagationCheck)
- A shared state directory is now used (/var/lib/acme/.lego)
to avoid account creation rate limits and share credentials
between certs
In 5532065d06, acme was changed to be
RemainAfterExit=true, but `postRun` commands are implemented as
`ExecStopPost`. Systemd now considers the service to be still running
after simp_le is finished, so won't run these commands (e.g. to reload
certificates in a webserver). Change `postRun` to use `ExecStartPost` to
ensure the commands are run in a timely manner.
A centralized list for these renames is not good because:
- It breaks disabledModules for modules that have a rename defined
- Adding/removing renames for a module means having to find them in the
central file
- Merge conflicts due to multiple people editing the central file
Add a new option permitting to point certbot to an ACME Directory
Resource URI other than Let's Encrypt production/staging one.
In the meantime, we are deprecating the now useless Let's Encrypt
production flag.
Previously setting `allowKeysForGroup = true; group = "foo"` would not
apply the group permission change of the certificates until the service
gets restarted. This commit fixes this by making systemd restart the
service every time it changes.
Note that applying this commit to a system with an already running acme
systemd service doesn't fix this immediately and you still need to wait
for the next refresh (or call `systemctl restart acme-<domain>`). Once
everybody's service has restarted once this should be a problem of the
past.
Let's encrypt bumped ACME to V2. We need to update our nixos test to
be compatible with this new protocol version.
We decided to drop the Boulder ACME server in favor of the more
integration test friendly Pebble.
- overriding cacert not necessary
- this avoids rebuilding lots of packages needlessly
- nixos/tests/acme: use pebble's ca for client tests
- pebble always generates its own ca which has to be fetched
TODO: write proper commit msg :)
Updating:
- nixos module to use the new `account_reg.json` file.
- use nixpkgs pebble for integration tests.
Co-authored-by: Florian Klink <flokli@flokli.de>
Replace certbot-embedded pebble
* nixos/acme: Fix ordering of cert requests
When subsequent certificates would be added, they would
not wake up nginx correctly due to target units only being triggered
once. We now added more fine-grained systemd dependencies to make sure
nginx always is aware of new certificates and doesn't restart too early
resulting in a crash.
Furthermore, the acme module has been refactored. Mostly to get
rid of the deprecated PermissionStartOnly systemd options which were
deprecated. Below is a summary of changes made.
* Use SERVICE_RESULT to determine status
This was added in systemd v232. we don't have to keep track
of the EXITCODE ourselves anymore.
* Add regression test for requesting mutliple domains
* Deprecate 'directory' option
We now use systemd's StateDirectory option to manage
create and permissions of the acme state directory.
* The webroot is created using a systemd.tmpfiles.rules rule
instead of the preStart script.
* Depend on certs directly
By getting rid of the target units, we make sure ordering
is correct in the case that you add new certs after already
having deployed some.
Reason it broke before: acme-certificates.target would
be in active state, and if you then add a new cert, it
would still be active and hence nginx would restart
without even requesting a new cert. Not good! We
make the dependencies more fine-grained now. this should fix that
* Remove activationDelay option
It complicated the code a lot, and is rather arbitrary. What if
your activation script takes more than activationDelay seconds?
Instead, one should use systemd dependencies to make sure some
action happens before setting the certificate live.
e.g. If you want to wait until your cert is published in DNS DANE /
TLSA, you could create a unit that blocks until it appears in DNS:
```
RequiredBy=acme-${cert}.service
After=acme-${cert}.service
ExecStart=publish-wait-for-dns-script
```
Previously the script would contain an empty `if` block (which is invalid
syntax) if both `data.activationDelay == null` and `data.postRun == ""`. Fix
this by adding a no-op `true`.
This is needed because simp_le expects two certificates in fullchain.pem, leading to error:
> Not enough PEM encoded messages were found in fullchain.pem; at least 2 were expected, found 1.
We now create a CA and sign the key with it instead, providing correct fullchain.pem.
Also cleanup service a bit -- use PATH and a private temporary directory (which
is more suitable).
* Create "full.pem" from selfsigned certificate
* Tell simp_le to create "full.pem"
* Inject service dependency between lighttpd and the generation of certificates
Side note: According to the internet these servers also use the
"full.pem" format: pound, ejabberd, pure-ftpd.
Commit 75f131da02 added
`chown 'nginx:nginx' '/var/lib/acme'` to the pre-start script,
but since it doesn't use `chown -R`, it is possible that there
are older existing subdirs (like `acme-challenge`)
that are owned to `root` from before that commit went it.