(Re)birth of a homelab

Table of Contents

1. Introduction
2. Considerations and planning
2.1. Goals
2.2. How to make services accessible from the internet
2.3. Requirements overview
2.4. Operating System (OS)
2.5. Virtual Private Server (VPS)
2.6. Tunneling solution
2.7. Virtualization and containerization
2.8. File system
2.8.1. Introduction to ZFS
2.8.2. Protection against hardware failure
2.8.3. Integrity checks and automatic repairs
2.8.4. Extensibility
2.8.5. Hot spares
2.8.6. Portability
2.8.7. Compression
2.9. Backups
2.9.1. The “3-2-1” rule
2.10. Memory
2.11. Uninterruptible Power Supply (UPS)
2.12. Monitoring
2.13. Scope
2.14. Threats and mitigations
3. Choosing and setting up the hardware
3.1. Buying prebuilt vs building your own
3.2. Core components for the main server
3.2.1. CPU
3.2.2. GPU
3.2.3. Motherboard
3.2.4. Memory
3.2.4.1. CPU
3.2.4.2. Motherboard
3.2.4.3. Memory stick
3.2.5. Storage
3.2.5.1. Shucking drives
3.2.6. PSU
3.2.7. Cases
3.2.7.1. Tower cases
3.2.7.2. Rack mounted
3.2.8. Complete builds
3.2.8.1. Build A: cheapest AM4, no ECC
3.2.8.2. Build B: AM4, with ECC memory
3.2.8.3. Build C: AM5, with ECC memory
3.3. Core components for the backup server
3.4. Networking and connectivity
3.5. UPS
3.6. Setting up the physical hardware
4. Configuring the software
4.1. Base OS and environment
4.1.1. Installing Debian
4.1.2. Set static IP
4.1.3. Synchronize time
4.1.4. Setup SSH and local login
4.1.4.1. Use a SSH key instead of a password
4.1.4.2. Change the non-root and root passwords
4.2. Containerization setup
4.2.1. Installing Docker
4.2.2. Container orchestration
4.3. Configuration for the main server
4.3.1. Install ZFS
4.3.2. Create the HDD pool
4.3.3. Create the SSD pool
4.3.4. Setup ZFS snapshots and retention policy
4.3.5. Other useful ZFS commands
4.3.5.1. Rename a pool
4.3.5.2. Change mountpoint
4.3.5.3. Import pool from another server
4.3.5.4. Upgrade a pool
4.3.6. JavaScript
4.3.7. PM2 process manager
4.3.8. Other useful tools
4.4. Configuration for the bastion server
4.4.1. Install Pangolin
4.4.2. Add captcha
4.4.2.1. Get Turnstile API keys
4.4.2.2. Download the captcha HTML template
4.4.2.3. Mount the HTML file into the Traefik container
4.4.2.4. Add captcha settings to the CrowdSec middleware
4.4.2.5. Test it
4.5. Configuration for the backup server
4.5.1. Install ZFS
4.5.2. Create the backup pool
4.5.3. Setup snapshots replication and retention policy
4.5.3.1. Install Sanoid and Syncoid on the backup server
4.5.3.2. Automating the replication process
4.5.3.3. Tunneled SSH connection
4.5.3.4. Hardening the SSH connection
4.5.4. Scrubbing the external HDD
4.5.5. Upgrading Newt
4.6. Deploying services
4.6.1. Troubleshooting
4.6.1.1. DNS resolution issues
4.6.2. Hardening
4.6.2.1. Users, groups, and permissions
4.6.2.2. Ressource sharing
4.6.2.3. Nginx
4.7. Monitoring and alerting
4.7.1. The Prometheus/Grafana/Loki stack
4.7.2. Configure Loki with Docker plugin
4.7.3. Sensors and IoT
5. Documentation
5.1. Labeling the drives
5.2. Keeping a list of your service users and groups
6. Showcase
7. Conclusion
7.1. Reflections
7.1.1. The ASRock AM5 fiasco
7.1.2. Other things
7.2. Future improvements
7.3. The end is never the end is never…
7.4. Other resources

Introduction

Hi! Do you enjoy spending time, money, and energy tinkering with servers at home when you could get paid for it? Or maybe you already are a sysadmin, but you want to learn and experiment in a more casual environment? Oooor maybe, you want to stick it to the tech overlords and take back control on your data and everyday tools?

Whatever your reason, you are not alone. I’ve been feeling the same pull toward owning more of my digital life. Part of it is the simple joy of building things, but also, I can’t help but notice how the internet is changing.

Between the UK’s Online Safety Act, the various laws passing in the US, and the recurring “Chat Control” proposals in the EU, it’s impossible to ignore the trend: anonymity and privacy are increasingly under threat.

And on top of that, we’re watching major platforms go through the familiar cycle of enshittification. Propelled by an established and enthusiastic userbase, they go public or get bought up by a bigger company. Months or years later, shareholders or the acquiring company start pushing the platform to squeeze out more profit. That’s when the decline begins: long-standing features disappear without warning, or get locked behind a “premium” subscription tier. Ads creep in, your data becomes training fodder for AI models, and open ecosystems are walled off by blocking third-party clients and tools.

In the wake of such trends, it’s up to us to reclaim autonomy over our digital data and the everyday tools we are depending on.

At its core, a homelab is a personal playground where you can learn and experiment with servers, applications, and services. It goes hand in hand with another concept, self-hosting: the practice of running services yourself, on hardware you control, instead of relying on big tech. Together, it means that you can build your own IT infrastructure, tailor the tools and services you use to your needs and keep your data safe and private.

If this sounds interesting to you, welcome aboard! I would like to share my journey with self-hosting, raise awareness about important considerations, and guide you through the process. As an appetizer, here’s the list of services I like to run on mine:

Vaultwarden: a lightweight password manager, alternative to Bitwarden.
File browser: organize, share, and control access to your files.
Syncthing: synchronize files between across devices, perfect for personal backups.
Jellyfin: stream movies and series privately, like a personal Netflix.
Navidrome: private music streaming, similar to Spotify.
Immich: manage and organize your photos and videos, like a private Google Photos.
Sure: track and categorize expenses to manage your finances.
Paperless: organize, search, and share documents, perfect for archiving invoices, medical records, and other important files.
This very blog: my own personal space to express myself and share my knowledge.

With the basics out of the way and a taste of what’s possible, it’s time to roll up our sleeves. Let’s start building a homelab!

About this guide

I wrote this guide as a way to document my journey. As such, it was written over the course of multiple months while building the homelab. I ended having some troubles, especially with my hardware choices. Make sure to read the conclusion section before making any decisions.

Considerations and planning

When starting any large project, it’s important to define your goals. They will guide you through the entire process and help you make decisions.

Goals are what you want to achieve with the project. They are meant to be broad and high-level.
Requirements translate the goals into concrete actions. They may also include additional constraints.
Scope sets the boundaries of the project. I find it useful to define that boundary by listing what’s out of the scope.

Let’s start with the goals.

Goals

As discussed in the intro, privacy and digital autonomy are key motivations for many people. They may have experienced one too many bad experience with enshittification or other egregious decisions made by the companies they rely on. They want to use services that are under their control, evolve at their pace, and not vanish because a company changed direction.

But for others, it can also start as a simple curiosity and a desire to learn: a way to deepen your understanding of Linux, networking, coding, or infrastructure by actually running the things you usually only read about. A homelab is a safe playground where you can experiment, break things, fix them, and come away with real skills.

The appeal can also be practical. Maybe you have a media library on a hard drive and you think it would be neat to be able to access it from anywhere, including on the go. Or a bunch of photos and videos you want to share with your family and friends. Maybe, like me, you want to have a personal corner of the internet where you can express yourself freely.

While typically not the main motivation, self hosting can also be cost saving, especially in the long run. This is particularly true if you or your family are reliant on a number of subscriptions or if you have copious amounts of data you need to store and access.

Those are some of the broader motivations that often bring people into the homelab/self-hosting world. With that in mind, here are the goals that guide the design of my setup (in order of importance):

I want my data to be safe. I want to feel reassured that my data can’t be lost or corrupted.
I want to keep things simple and reliable. I don’t want to spend all my free time troubleshooting or fixing things.
At the same time, I want to experiment with new services and tools without feeling limited or scared to break existing things.
I want to keep my homelab reasonably compact, quiet, and cost effective.
I want to reduce my reliance on any specific provider and to be able to move effortlessly to another provider or offer if needed.

By clarifying your goals upfront, you create a roadmap for every decision you make—from choosing hardware to picking file systems, containers, and backup strategies. Every choice in this guide ties back to these motivations. Yours may differ, but having them written down keeps the project focused and intentional.

How to make services accessible from the internet

Before I start expanding on the requirements, I want to give you a concise explanation of how hosting works and the challenges associated with it. That way, we are all on the same page and the rest of the guide will make more sense. If you are already familiar with the topic, feel free to skip this section.

In this example, we will be hosting my blog, r-entries.com.

When a user, let’s say Alice, tries to access the blog, their browser will send a request to a DNS server to find the IP address of the website. DNS is like a phone book for the internet. In this case, let’s say the returned IP address is 12.34.56.78.

Now that the browser knows the address, it will prepare and send a request to that address. In an ideal world, the server is directly accessible from the internet, so it receives the request, process it, and sends the response back to the browser.

The browser then receives the response and displays the content of the blog.

And that’s it! Quite simple, don’t you think?

How Alice’s browser knows the DNS server’s IP?

No chicken or the egg dilemma here! DNS servers are pre-configured either by the home router, in the device settings or the browser settings. They are configured with an IP address and not a domain name, removing the circular dependency.

But in reality, it’s highly unlikely that the server is directly accessible from the internet. When we say “IP” we typically mean IPv4. It’s the protocol used to uniquely identify devices on the internet. It was first deployed in the 1980s at a time where the “internet” was still a small community of researchers and universities. With a little over 4.2 billion possible addresses, it was far enough to cover the needs of the time. But as the internet grew, the possibility of running out of addresses became more and more imminent.

If the mid 90s, NAT was introduced to help alleviate the address exhaustion problem. Before the introduction of NAT, every connected device had a globally unique IPv4 address, just like the earlier example with Alice. With NAT, devices behind a router will be assigned a local IPv4 address, and the router itself will only use a single public IPv4 address.

A quick analogy about NAT

Businesses typically have a single public phone number, instead of dedicated landlines for each employee. When you call it, the receptionist answers and then routes you to the right person. Internally, they may have a local network of phones, with internal numbers for each employee. These internal numbers are unknown to the public, and even if they were, they couldn’t be used to call an employee directly.

So NAT is really the same idea but applied to computers on the internet. Instead of a receptionist, you can configure the NAT table to forward different requests to different internal devices.

Let’s update the diagram to reflect the introduction of NAT:

When the router receives the request, it will look up the destination IP address in the NAT table. If you configure it correctly, it will find the local IP address (e.g: 192.168.1.200) of the server and forward the request to it.

You’ll also notice that the router—sitting right on the edge between the internet and the local network—posseses both a public and a local IPv4 address. The internal address is used by devices within the local network to reach the router. Meanwhile, external devices use the public address to contact the router.

Unfortunately, this is not enough. With the adoption of mobile devices, the number of devices connected to the internet exploded. Mobile carriers had to find a way to accommodate for the flood of new devices and decided to introduce CGNAT. It’s carrier grade because it’s not done at the level of the home router, but at the level of an entire network of routers.

When applied to home ISPs, it means that you are no longer assigned a public IPv4 address, but a local one. Nested layers of local networks with NAT placed at each boundary to forward requests appropriately.

This doesn’t impact the average user, but it has major implications for self-hosting. The problem is that you don’t have access to the carrier’s NAT table. And even if you did, it basically means only one customer could make use of the public IPv4 address.

So… self-hosting is no longer possible in modern, IPv4-exhausted, internet?

There are actually several solutions to circumvent this issue:

Ask your ISP to move you from a CGNAT network and provide you with a dedicated public IPv4 address. They may tell you such feature is not available on your standard plan and you need to upgrade to a business-oriented plan. This typically costs several times more than a standard plan.
Support IPv6 only. IPv6 address space is so much larger than IPv4, it doesn’t have the same address exhaustion problem. With IPv6, say goodbye to NAT and CGNAT: every device has a unique global IPv6 address, just like it used to be with IPv4. It’s the future of the internet, but unfortunately, not all ISPs support IPv6 yet. The adoption is still under 50 % worldwide. If it continues at this rate, it may take another 15 years to reach near-universal adoption.
Use some form of tunneling. Tunneling is a technique to encrypt and forward traffic through a secure channel, crossing the NAT and CGNAT boundaries.

I won’t go into the details of how tunneling works, but if you ever used a VPN, that’s what they use. With a tunnel, you recreate a standard NAT setup, completely bypassing the CGNAT:

The server is technically on both the VPN and the house local network, I just wanted to focus on the part that's relevant for hosting the services.

It exists a few different types of tunneling:

Reverse tunnel services like Cloudflare Tunnel. However, you have to trust the provider to not snoop on your traffic which is temporarily decrypted on their end. Personally, I’m not comfortable with this solution: I find their free-tier far too generous and that makes me doubt their intentions.
Use a VPN service like NordVPN, TorGuard, or Windscribe. Look for a provider that can offer a dedicated public IPv4 address (which typically comes with an extra cost).
Rent a VPS from a provider. No need to go into the details, what matters is that it’s a server that can be remotely accessed and controlled. This is the solution I’m going to use to make my services accessible from the internet. I’ll explain the reasoning behind this choice in the next section. Make sure they provide a public IPv4 address as part of the service.

And that’s it for the foundational networking knowledge. I hope you found it interesting and informative. It will be useful for understanding some of the decisions I’ll be making later on.

Requirements overview

To achieve the goals I’ve outlined earlier, I’m planning to use a 3 part system:

A main server to host the services and store my data. It will be hosted at my place.
A bastion server, the entrypoint to the homelab and its services. It will be hosted at a VPS provider.
A backup server, where offsite backups will be stored. It will be hosted at my mom’s place.

What I mean by a bastion server is a server that is used to access the homelab and its services from the internet. As explained earlier, because of CGNAT and other network restrictions, the main server cannot be directly accessed from the internet. So we need a bastion server to act like a gateway and to tunnel the requests to the main server.

I could have choosen another type of tunneling, like a VPN service that offers a dedicated public IPv4 address. But I think a VPS is more sensible and flexible. Entry-level VPS starts under 5€/month, which is on par with VPN services. Beside, you can get a lot more from a VPS. It’s a full-fledged server after all. Also, unlike a VPN or Reverse tunnel service, you typically have full control over the server’s software. Sometimes you even get to install the OS yourself. In conclusion, a VPS seems like the best option among the available solutions.

And lastly we have the backup server. As I stated in my goals, data safety is my top priority. Having a backup server located elsewhere is a great way to protect against disasters or home break-ins. If the main server is stolen or destroyed, the backup server can be used to restore the latest backups. If the backup server is stolen or destroyed, the latest data is still safe on the main server.

In the following sections, I’ll explore different aspects of the project and which choice best aligns with my goals.

Naming the servers

Seeing this 3 part system, I’m thinking of naming the servers Melchior (main), Balthasar (bastion), and Casper (backup). This is a reference to the Magi System from Neon Genesis Evangelion.

Operating System (OS)

This section is a work in progress.

Come back later :)

Virtual Private Server (VPS)

I’m going to use a VPS provider for the bastion server. All I really need is a server with:

Debian 13
SSH access
a public IPv4 address
at least 2 vCores
at least 2 GB of RAM
at least 20 GB of storage
a European company and data center

My internet speed is pretty fast with 2 300 Mbps download and about 1 000 Mbps upload. A provider with 1 Gb/s speed would be ideal for fast downloads. But I’m also thinking that a lower speed—on top of reducing the cost—would reserve more bandwidth for my own home internet usage.

Offer	CPU	RAM	Storage	Speed	Price (no commitment)	Price (12 months commitment)
Contabo (VPS 10)	4 vCores	8 GB	75 GB	200 Mb/s	4.50 € / month	3.60 € / month
IONOS (VPS Linux S)	2 vCores	2 GB	80 GB	1 Gb/s	6.80 € / month	4.20 € / month
OVH (VPS-1)	4 vCores	8 GB	75 GB	400 Mb/s	5.39 € / month	4.58 € / month
OVH (VPS-2)	6 vCores	12 GB	100 GB	1 Gb/s	8.39 € / month	7.14 € / month
o2switch (Grow)	8 vCores	16 GB	Unlimited	1 Gb/s	Not available	8.40 € / year

In the end, I decided to go with OVH. Specifically their VPS-1 offering:

CPU: 4 vCores
Memory: 8 GB
Storage: 75 GB (NVMe)
Bandwidth: 400 Mbit/s
Traffic: Unlimited
Price: 5.39 € / month (no commitment)
Backups: Daily, keep one (keep 7 for 2.20 € / month)
Snapshots: No snapshots (one snapshot for 0.60 € / month)
SLA: 99.9 % uptime
OS: Fedora (40, 42), AlmaLinux (8, 9, 10), CloudLinux (9), Debian (11, 12, 13), Rocky Linux (8, 9, 10), Ubuntu (22.04, 24.04, 24.10, 25.04), FreeBSD (14.3)
Comes with a public IPv4 and IPv6 addresses
SSH access

Tunneling solution

On my homelab, I would like to host two types of services:

Private services (such as Immich) that should only be accessible by a limited number of users (mostly me).
Public services (such as r-entries.com) that should be directly accessible without any authentication.

This is typically handled by a reverse proxy, a service that act like an entrypoint for other services. A reverse proxy typically handles routing, filtering requests, authentication, TLS/HTTPS, logging, and more.

On the other hand, the tunneling solution is responsible for creating a secure channel between two servers, the bastion and the main server in our case.

Both functions can be completely separate, but in my case I think I’m going to use Pangolin, a tunneled reverse proxy.

Setup is pretty straightforward:

On the bastion server, you install Pangolin.
On the main server, you install Newt, a lightweight WireGuard client.

Even if you are behind a CGNAT, you can now access your services from the internet. By default, Pangolin consider all services as private. When you try to reach a service, you will be redirected to the Pangolin login page. Once logged in, the URL reverts to the service you were trying to access.

While Pangolin also offers a Cloud option, we will focus on the self-hosted version where everything is handled locally.

Authentication vs Authorization

Authentication is verifying who the user is. Authorization determines what the user is allowed to do. Both can be used in tandem, or you can have one without the other.

You have a full range of options to either limit or extend access to your services.

You can make services completely opened to the public.
You can ditch authentication and just use authorization with a password, pin code, or a sharable link.
Or keep it as default, requiring both authentication and authorization.

Furthermore, you can define ranked rules to either block or allow access from specific IPs, geolocation, URL paths, and more.

If you choose to keep authentication, you can have Role-Based Access Control (RBAC) to grant or block access to services based on the user’s role. This gives you further control over who can access what. And contrary to other solutions like Tailscale, you don’t need to install anything on the user’s side.

It can also work in tandem with CrowdSec, a modern, open-source, collaborative behavior detection engine, integrated with a global IP reputation network. It functions as a massively multiplayer firewall, analyzing visitor behavior and responding appropriately to various types of attacks.

Virtualization and containerization

Virtualization offers strong isolation and the ability to run multiple operating systems on the same hardware. However it also adds extra overhead and complexicity.

Containerization offers a lighter alternative to virtualization. They are fast to deploy, there’s barely any overhead, and they are easy to manage. However, they don’t offer the same level of isolation as virtualization.

In my previous homelab, I was running XCP-ng as the hypervisor. Then I had a few VMs running on it. And each would use Docker containers to run services. But all the VMs were running Debian, so I’m unsure if it was worth the complexity of running a hypervisor. Also I think having to choose how much ressource to allocate to each VM was a bit of a pain, especially when you realize you need to expand the VM’s storage.

Thus, my plan is to run Debian on bare metal on all three servers. And use Docker containers to run services. If in the future I decide that virtualization is needed, I can always install Proxmox VE on the main server. Proxmox is already shipped with Debian so installing it should be simple enough. They even have a guide for that.

File system

Introduction to ZFS

So I’ve heard of ZFS before, but I’ve always thought it looked complicated and intimidating. Probably because it does a lot more than your average filesystem.

ZFS, or more specifically OpenZFS, describes itself as an “open-source storage platform”. Here’s a few of its key features:

Protection against hardware failure with mirroring and RAIDZ.
Protection against data corruption with integrity verification and automatic repairs.
Built-in snapshots and replication.
Support for massive files and massive storage capacities (ZFS stands for Zettabyte File System after all).
Support transparent compression and hardware-accelerated native encryption.

Protection against hardware failure

Here’s a quick overview of how it works. What’s exposed as storage to the system is pools. Pools may include one or more vdevs. A vdev may include one or more disks of the same capacity.

A vdev can be of multiple types:

single: it’s just a single disk. Offers no redundancy but still detects data corruption.
mirror: two or more disks are mirrored, meaning the exact same data is written to all disks. If one disks fails, the continues to operate normally while you replace the faulty disk.
RAIDZ1: three or more disks are used to store data. Parity is computed for each strip (block of data) and then distributed across the disks. If one disk fails, the vdev continues to operate normally while you replace the faulty disk. The total usable capacity is the total capacity of the disks minus one disk worth of parity. This is similar to RAID5.
RAIDZ2: similar to RAIDZ1 but with two parity strips. This means two disks can fail before data loss occurs.

In general:

Mirror requires at least 2 disks to operate. Mirroring with N disks can sustain N-1 disk failures. The total usable capacity is the capacity of just one of the disks.
RAIDZN requires at least N+2 disks to operate and can sustain N disk failures. The total usable capacity is the sum of the disks’ capacities minus N disks worth of parity.

For exemple, in RAIDZ2, you need at least 4 disks. Let’s say you have 6 disks with 1 TB of capacity each. The total usable capacity is 4 TB (1 TB × 6 disks - 2 TB of parity). You could lose 2 disks before data loss occurs.

Anytime a faulty disk is replaced, ZFS will automatically start recreating the data that was on the faulty disk using the other disks. In the end, the new disk will be a perfect copy of the faulty disk. This process is called resilvering.

Resilvering can take a while, especially when dealing with terabytes of data. It also depends on the topology of the vdev, and how much the disks are utilized during the process. It’s not uncommon to see resilvering take days.

If you choose a vdev type where only one disk can fail, this resilvering period can be very stressful. It’s unlikely for another disk to fail during this period, but it’s still a risk. And it happens, all the data is irrecoverable. This is why with larger, more numerous disks it’s highly recommended to choose at least RAIDZ2.

Changing vdev type

Once a vdev is created, you can’t change its type. For exemple, you can’t change a RAIDZ1 to a RAIDZ2.

Integrity checks and automatic repairs

Protection against hardware failure is cool and all but I think integrity checks and automatic repairs really make ZFS stand out. It is achieved by storing a checksum for each block of data. During read operations the checksum is verified. If the checksum does not match, then ZFS knows that the block is corrupted. It will then attempt to repair the data:

In the case of mirroring, ZFS will use the first copy of the data where the checksum matches to repair the data.
In the case of RAIDZ, parity data is used to repair the data. The same way that RAIDZN can sustain N disk failures, ZFS can repair a corrupted block as long as one copy of the parity data is available. If more than N copies of the data and its parity are corrupted, then the data is irreparable.

Even if ZFS cannot repair the data, it will at least notify you that a block is corrupted.

Furthermore, ZFS doesn’t just wait for data to be read to verify the checksum. It also periodically perform integrity checks and automatic repairs of all the data. This process is called scrubbing.

Extensibility

ZFS mirror and RAIDZ can be expanded by adding new disks to the vdev. In the case of RAIDZ, this is a recent feature that was introduced in OpenZFS 2.2.

You can also replace the disks in a vdev with larger ones. You need to do that one disk at a time and resilver the vdev after each replacement.

Finally, you can simply add more vdevs to a pool.

Hot spares

Disks can be assigned as hot spares for a pool: if a disk fails in any of the vdevs, resilvering can immediately start on one of the spare disks. This can be useful if you can’t readily access the server to replace the faulty disk (e.g: you’re on vacation).

However, if you only have one vdev, I would argue that it’s better to opt in for a more resilient vdev type (e.g: choosing RAIDZ2 instead of RAIDZ1).

In general, it’s still recommended to have a spare disk on hand in case of a hardware failure. This way, you can replace the faulty disk faster and the system will return to a nominal state faster.

Portability

Each disk in a pool has ZFS labels written at the start and end of the device. Those labels contain the pool name, vdev layout, and GUIDs.

So, if you decide to move the pool to another server, you just need to move the disks to the new server. Disk order does not matter (you don’t have to plug them into the same SATA ports, or even the same controller brand). As long as all the required disks are present, ZFS can figure out the correct configuration.

Compression

Another great feature is that ZFS can transparently compress the data. Transparent here means that the data is compressed on write and decompressed on read: the user never realizes that the data is compressed.

It’s not necessary to disable compression on datasets which primarily have incompressible data on them, such as folders full of video or audio files. ZFS is smart enough to not store data compressed if doing so wouldn’t save any on-disk blocks.

It may be counter-intuitive but compression can increase read and write performance. Sure the CPU has to compress and decompress the data, but smaller resulting files also means less data to read or write on the disk. If disk I/O is the bottleneck (e.g: you’re using HDDs), compression will increase effective I/O performance.

Backups

ZFS is uniquely suited for backups. It’s built-in replication and snapshots make it easy to restore data to a previous state.

Snapshots are a frozen, read-only view of the filesystem at a given point in time. The data is not duplicated, any modification after the snapshot creation is tracked separately. So, to restore the filesystem to a given snapshot, you just need to discard those modifications. They are instant to create and space-efficient.

They also offer good protection against user error and ransomware. The snapshots are immutable, so even if all your files get encrypted, you can still restore the filesystem to a previous snapshot.

Importantly, restoring a snapshot means restoring the entire filesystem to that point in time. This can be useful if an update went wrong or if a ransomware attack encrypted most files. If you only need to restore a specific file or folder, this can be achieve as well. Snapshots are read-only but they can be browsed and copied from just like any directory.

Replication is the process of copying snapshots from one ZFS pool to another. It can be used to backup the data to a remote location. Subsequent replications will only copy the differences between the previous snapshot and the current state of the filesystem. So it’s incremental and network-efficient.

Snapshots can be scheduled to be created at a given frequency. Retention policies can be set to automatically delete old snapshots. The same can be done for replications. You can decide to only replicate snapshots at a given frequency. Replication and retention policies can be set on a folder by folder basis if needed.

Snapshot frequency, replication frequency, and retention should reflect your tolerance for data loss. For me, I’m okay with losing a week of data if my home is burned to the ground or a thief breaks in. This means I don’t need replication to be any more frequent than once a week.

This is the proposed policy for the main server:

A snapshot is created every hour.
Hourly snapshots for the last 24 hours → rollback from accidental deletes or ransomware.
Daily snapshots for the last 7 days → covers recent human error.
Weekly snapshots for the last 4 weeks → medium-term rollback.
Monthly snapshots for the last 6 months → long-term baseline.
Replication to the backup server is done weekly.

On the backup server, I can have a longer retention policy:

Weekly snapshots for the last 12 weeks.
Monthly snapshots for the last 24 months.
Yearly snapshots for the last 6 years.

The “3-2-1” rule

You’ve probably heard of the 3-2-1 backup rule. It’s a simple backup strategy that ensures you have at least 3 copies of your data, 2 on different media, and 1 offsite. So far, we have planned for 2 copies of the data, including 1 offsite.

There are three major media types:

Magnetic storage: HDD, tape
Optical storage: DVD, Blu-ray
Solid-state storage: SSD, NVMe

We are already using magnetic storage in the form of HDDs in the main and backup servers. Beside the cost, it would be silly to use SSDs for backup purposes and not for the live data.

That leaves us with optical storage for the third copy. Blu-ray discs have gone down in price. They sit today at around 30 € per TB. The drive/burner itself cost below 100 €.

There is also the Optical Disc Archive technology, which promises larger capacity (up to 5.5 TB per cartridge) and longer lifespan (up to 100 years). But the technology is officially discontinued since 2023. The cartridges are not that much more expensive at 40 € per TB. But the drive itself is impossible to buy new, and the used ones are above 1 000 €.

Finally we have the M-DISC technology, which are basically Blu-ray discs with a different type of material that’s more durable. The claim is that they can last up to 100 years. The price is around 180 € per TB. Basically every BD drive can read M-DISC discs, but only compatible drives can burn them. The drive is around 120 €.

Media	Price per TB	Price of the drive	Advertised longevity
Blu-ray	30 €	100 €	10-20 years
Blu-ray M-DISC	180 €	120 €	100 years
Optical Disc Archive	40 €	1 000 €	100 years

Based on the price per TB, if I have a replacement cycle every 10 years, regular BD would be more cost-effective for the first 60 years. In the next 60 years, it’s very likely that new technology will be released and make the M-DISC completely obsolete.

Beside, if I replace the set of discs every 10 years, it means I’ll technically have more than one copy of the data on optical discs. Beside having more copies, it will give me the opportunity to collect empirical data on the disc aging.

So in conclusion, I’ll be using Blu-ray discs for the third copy. Here’s the plan:

Verbatim BD-R DataLifePlus 50GB 6X
Burn them at low speed with verification and keep checksums of the data on the discs themselves and on the other storage media.
Integrity checks of the discs every year.
Replace the set of discs every 10 years.
Continue monitoring discs from the previous set to learn more about their actual longevity.
Keep an eye out for new technologies that could replace Blu-ray discs for the next replacement cycle.

Other tips:

Store optical discs vertically in standard jewel cases without additional materials in the case.
If you want to label the discs, use a water-based permanent marker on the clear inner hub, not the top surface.
20-50 % humidity and temperatures between 15-25 °C are ideal.

More info at this article on Optical Discs by the Canadian Conservation Institute.

Memory

Error-correcting code memory, commonly known as ECC Memory, is a type of RAM known for its data integrity and reliability. This specialized type of computer data storage uses a more sophisticated technology than standard RAM to detect and automatically correct internal data corruption.

The ECC procedure typically involves adding a few extra bits to each chunk of data stored in memory. These extra bits, known as parity bits, allow the system to determine if the data has been corrupted. If an error is detected, the ECC memory can often find the exact bit that is incorrect and restore it to the proper value without the user or the running program even realizing anything happened.

To use ECC memory, we need:

A CPU that supports ECC
A motherboard that supports ECC
A memory stick that supports ECC

It needs support at all levels because ECC not only protects the data when in RAM, but also as it travels between the CPU registers and the RAM sticks.

Of course the tradeoff is that ECC memory is more expensive. Also, finding a consumer-grade CPU and motherboard that support ECC memory can be difficult.

ECC memory is one more protection against data corruption. It’s complementary to the other protection mechanisms provided by ZFS.

Uninterruptible Power Supply (UPS)

This section is a work in progress.

Come back later :)

Monitoring

As stated earlier, I’m unhappy about my lack of monitoring and alerting on the previous home lab. This time, I want to make sure things are working as expected and be alerted as soon as possible if something goes wrong.

Here’s a list of the metrics I want to monitor:

Servers
- Ressource usage: CPU, RAM, storage, disk I/O, network
- Temperature: CPU, disks, room
- Disks SMART status
- System updates
UPS
- Status
- Battery level,temperature, and health
- Temperature
Costs
- Electricity cost
- ISP subscription
- VPS subscription
- Hardware cost
CrowdSec
- Blocked requests count (total, by scenario)
- Blocked IPs (total, by scenario)
Traefik
- Requests count (total, by domain)
- Requests latency (avg, max, min)
- Errors count (total, by status code)
- Geographic distribution (top 10 countries)
ZFS
- Pool health, error counts and scrub status
- Snapshots jobs and current retention
- Replication jobs
Services
- Ping or health-check uptime
- Image updates

Then I can setup alerts when those metrics are outside of acceptable ranges. For exemple:

The scheduled scrub is not running on time
Storage is almost full
Temperatures have been above a threshold for some time
Power surges or outages
Services seem unavailable

Scope

Things that are out of scope:

High availability

This section is a work in progress.

Come back later :)

Threats and mitigations

As a recap of everything I’ve mentioned so far, here’s a list of potential threats and how I plan to mitigate them:

Hardware threats
- Disk failures
  - Covered by RAIDZ on both main and backup servers.
  - SMART monitoring and alerts for signs of imminent failure.
  - Alerting when a disk fails so that it can be replaced ASAP.
  - Spare disks ready to swap in if needed.
- Data corruption
  - Covered by ZFS block checksums.
  - Self-healing by ZFS.
  - Checked by monthly scrubbing on both main and backup servers.
  - ECC memory participating in data integrity on the main server.
  - UPS for clean shutdowns on both main and backup servers.
- Power outages and surges
  - Covered by UPS on both main and backup servers.
  - Bi-yearly UPS battery test.
- Overheating
  - Monitor temps including disks temps.
  - Monitor room temps and adjust fans accordingly.
  - Use dust filters and clean them regularly.
  - Ensure positive pressure in the computer cases to reduce dust accumulation.
Provider threats
- ISP issues (CGNAT, IPv4 change, outage, blocking ports)
  - Tunnel to the bastion server using Pangolin.
  - Could still use my phone or a 4G/5G router as a backup.
- VPS provider failure
  - Backups to the backup server.
  - If the VPS is unreachable, I could relatively quickly switch to another VPS provider using the latest backup.
Malicious threats
- Bot attacks / DDoS
  - Covered by the bastion server (Pangolin + CrowdSec).
  - Reduce attack surface by making sure only a few ports are open.
- Ransomware
  - Recoverable using local ZFS snapshots and remote backups.
- Vulnerabilities
  - Keep packages and Docker images up to date.
  - Enforce SSH key only authentication on all hosts.
  - Enforce 2FA on publicly accessible services.
  - Audit ~/.ssh/authorized_keys and /etc/sudoers regularly.
  - Network segmentation between containers.
  - Network segmentation between the home lab and other devices on the local network.
- Disaster / theft
  - Use locks and chains to secure the servers and slow down / discourage thieves.
  - Offsite backup
Human threats
- Accidental deletion / overwriting
  - Covered by local ZFS snapshots and remote backups.
- Disk misplacement during maintenance
  - Use stickers to mark the disks and their assigned location.
  - Refer to that label when alerting about a disk needing to be replaced.
- Forgetting my own setup
  - Scripts to bring a fresh Debian server to a working state.
  - Docker Compose and documentation for each service.
  - Infrastructure-as-Code.
Operational risks
- Backup failures
  - Monthly restoration test.
  - Monitoring of ZFS replication jobs.
- Over-utilized resources: alerting
- Alerting failures
  - Have weekly alerts just to confirm that things are working as expected.
  - Monthly manual checks.

Choosing and setting up the hardware

Buying prebuilt vs building your own

This section is a work in progress.

Come back later :)

Core components for the main server

I have a few requirements:

I’ll start with 4 × 3.5” HDDs but in terms of upgradability, I’ll aim for 6 × 3.5” HDDs.
Support for 2 × NVMe SSDs (or at least SATA SSDs)
32 GB of memory (upgradeable to 64 GB if needed)
Efficiency is very important considering the system will run 24/7
CPU performance should be sufficient. I like to use PassMark scores to compare CPUs. Let’s aim for a score of 20 000 or more.

And for the nice to have:

ECC memory
Hardware acceleration for video transcoding and AI workloads

CPU

This section is a work in progress.

Come back later :)

GPU

A GPU (dedicated or integrated) is not strictly necessary. But it can be useful for:

Simply to display a video output (accessing the BIOS, debugging, etc…)
Image processing (generating thumbnails)
Video transcoding (Jellyfin, Plex)
AI workloads (Immich face recognition, LLMs)

For video transcoding, Jellyfin recommends: - Intel Arc A series or newer - Nvidia GTX16/RTX20 series or newer (Excluding GTX1650) - AMD is NOT recommended.

NVIDIA provides a list of supported video encoding and decoding features for their GPUs. Overall, NVIDIA GPUs are pricier but well regarded for their gaming and video transcoding capabilities.

Entry-level Intel Arc GPUs are considered mediocre for gaming but provide excellent video transcoding capabilities considering their price, consumption, and size.

AMD GPUs are considered behind the competition when it comes to video transcoding. Even when the codecs are supported, performance or image quality may be lacking.

Depending on the choice of case, we may have to look for a low profile GPU.

Motherboard

For the motherboard, it’s not too difficult to find one with 6 SATA ports. If we want more SATA ports, we have a few options:

Specialized motherboards with more SATA ports (rare, expensive)
M.2 to SATA adapter (up to 6 SATA ports)
PCIe to SATA adapter
PCIe to SAS + SAS to SATA cable (typically 2 SAS ports, which means 8 SATA ports)
If you really want to go crazy, you can use a PCIe x16 to 4x4 bifurcation cards to convert a single PCIe x16 slot to 4 × M.2 slots. Then you can use a M.2 to SATA adapter on each M.2 slot. This means 24 SATA ports.

Other recommendations:

Update the BIOS before installing the CPU: they have been user reporting that their newly bought CPU (e.g: Ryzen 9000 series) got overvolted when installed on a motherboard that didn’t support the CPU without a BIOS update.
Prefer ASRock or ASUS motherboards if you want to use ECC memory. Most other brands don’t support it.

Memory

CPU

Ryzen 3000/4000
- Without integrated graphics: support ECC memory
- With integrated graphics:
  - Pro variant: support ECC memory
  - Non-Pro variant: no support
Ryzen 5000
- Without integrated graphics:
  - Ryzen 5 5500: no support
  - Others: support ECC memory
- With integrated graphics:
  - Pro variant: support ECC memory
  - Non-Pro variant: no support
Ryzen 7000: support ECC memory
Ryzen 8000
- Pro variant: support ECC memory
- Non-Pro variant: no support
Ryzen 9000: support ECC memory

For the AM5 platform we have this handy community spreadsheet.

Motherboard

In general, Asus and ASRock motherboards support ECC memory. Other brands typically don’t.

For the AM5 platform we have this handy community spreadsheet.

Memory stick

Look for ECC UDIMM (Unbuffered DIMM).

Avoid RDIMM (Registered DIMM), which is only available in workstation / server motherboards.
Avoid “on-die ECC” only. It is a required feature on DDR5 memory sticks, but only corrects single-bit errors within the memory chip itself.

Storage

You can easily check what’s the best price for different storage types and capacities at diskprices.com.

For ZFS, it’s important that the hard drives are not SMR or else the performance will worsen. This can make resilvering much slower. You can use this list to check if disk is not SMR. CMR is the preferred type.

Shucking drives

https://www.ifixit.com/Guide/How+to+Shuck+a+WD+Elements+External+Hard+Drive/137646

PSU

Considering the rest of the requirements, we can expect the system to draw a maximum of 400 W. This is because HDDs will draw significantly more power at startup than during normal operation.

During normal operation, the system should draw around 120 W.

Any PSU rated for 550 W or higher should be sufficient.

Obviously, efficiency is important for a home lab that will run 24/7. For example, the be quiet! PURE POWER 12 M 650 W advertises a 80+ Gold rating. This means:

at 20 % load, the efficiency is at least 90 %.
at 50 % load, the efficiency is at least 92 %.
at 100 % load, the efficiency is at least 89 %.

So the efficient should be above 90 % most of the time, which is great. I would recommend at least a 80+ Gold rating.

It can be difficult to find a PSU with sufficient amount of SATA or PATA (aka molex) cables out of the box. If the PSU is modular, you should check if the PSU has enough “Drives” / “SATA/PATA” ports. If so, maybe the manufacturer provided cables that can be purchased separately.

It’s also possible to buy compatible SATA/PATA cables from cablemod for about 10 € per cable, 15 € for shipping.

Considering that I’m not likely to use more than 6 × 3.5” HDDs, I don’t have to worry about how much current is going through the 12 V lanes. Research into it if you’re using more disks.

TODO: add my research on power efficiency

Cases

The complexity is to find a case that will accomodate at least 6 × 3.5” disks. This is becoming increasingly difficult as nowadays people will likely use a NVMe SSD and maybe one or two 3.5” HDDs.

They are some specialized “PC NAS” cases, but they are typically quite expensive and value compactness. This means that the motherboard will be Mini-ITX or Micro-ATX, which are typically more expensive and provide less slots and features.

Then there are rack mounted cases, which are also expensive. Typically, you can use any size of motherboard but you are still limited by the height of the case. This means the CPU cooler and the GPU need to be low profile. You can easily find cases with at least 6 × 3.5” bays and some with hot-swappable bays on the front of the case which makes maintenance very convenient.

Tower cases

Here’s a few I like:

Jonsbo N3
- 8 × 3.5” HDD bays
- Mini ITX motherboard
- SFX PSU (up to 105 mm in length)
- Use a HDD backplate, needs 2 molex connectors to power all drives
- 2 × 100 mm fans + 2 × 80 mm fans
- 250 mm wide, 210 mm tall, 374 mm deep
Fractal Design Node 304
- 6 × 3.5” HDD bays
- Mini ITX, Mini DTX motherboard (avoid angled SATA connectors)
- ATX PSU (up to 160 mm in length)
- No HDD backplate
- 2 × 92 mm fans + 1 × 140 mm fan
- 233 mm wide, 298 mm tall, 262 mm deep
Fractal Design Node 804
- 10 × 3.5” HDD bays + 2 × 2.5” SSD bays
- Micro ATX, Mini ITX motherboard
- ATX PSU (up to 260 mm in length)
- No HDD backplate
- 2 × 92 mm fans + 1 × 140 mm fan
- 344 mm wide, 307 mm tall, 389 mm deep

Rack mounted

“19-inch” is the most common rack width. The minimum opening width is 450 mm and the width of the cabinet is 24 inches (600 mm).
Rack mounted equipment also comes in standardized U height units. 1U is 1.75 inches (44.45 mm).
- A 1U server is 44.45 mm tall.
- A 2U server is 88.9 mm tall.
- A 3U server is 133.35 mm tall.
- A 4U server is 177.8 mm tall.
Racks also comes in U units. So you buy a 16U rack, you can put 16U worth of equipment in it.
Racks depth is not standardized. If the rack is enclosed, you have to make sure the equipment fits.

Here’s one I like:

Inter-Tech IPC 3U-3508
- https://www.inter-tech.de/productdetails/3U-3508_EN.html
- 8 × 3.5” hot-swappable HDD bays
- 2 × 2.5” internal SSD bays
- Mini-ITX, Micro-ATX, ATX motherboards
- ATX PSU
- 2 × 80 mm fans + 2 × 60 mm fans
- max. 100 mm tall CPU cooler / GPU
- max. 244 mm long GPU
- 480 mm wide, 132 mm tall, 528 mm deep (3U rack)

Complete builds

Build A: cheapest AM4, no ECC

Component	Name	Price
Rack	Inter-Tech IPC 3U-3508	200 €
Motherboard	MSI B550-A PRO	100 €
CPU	Ryzen 5 5500	80 €
Cooler	Noctua NH-L12Sx77	80 €
RAM	Integral 32 GB DDR4-3200	70 €
GPU	Sparkle Arc A380 GENIE	130 €
PSU	be quiet! Pure Power 12M 750 W	130 €
NVMe	2 × Crucial P310 SSD 500 GB	100 €
HDDs	4 × WD Ultrastar DC HC320 8 TB	530 €
Total		1 420 €

Note: Ryzen 5000 series processors can support DDR4-3200 memory with two sticks but only 2 666 MHz on 4.

Build B: AM4, with ECC memory

Component	Name	Price
Rack	Inter-Tech IPC 3U-3508	200 €
Motherboard	ASROCK B550 Pro4	100 €
CPU	Ryzen 7 5700X	150 €
Cooler	Noctua NH-L12Sx77	80 €
RAM	Timetec Hynix IC DRR4-2666 2x16 GB	100 €
GPU	Sparkle Arc A380 GENIE	130 €
PSU	be quiet! Pure Power 12M 750 W	130 €
NVMe	2 × Crucial P310 SSD 500 GB	100 €
HDDs	4 × WD Ultrastar DC HC320 8 TB	530 €
Total		1 520 €

Build C: AM5, with ECC memory

Component	Name	Price
Rack	Inter-Tech IPC 3U-3508	200 €
Motherboard	ASROCK B850 Pro RS	200 €
CPU	Ryzen 7 9700X	280 €
Cooler	Noctua NH-L12Sx77	80 €
RAM	Kingston Server Premier 32 GB DDR5-5600 ECC	200 €
GPU	Sparkle Arc A380 GENIE	130 €
PSU	be quiet! Pure Power 12M 750 W	130 €
NVMe	2 × Crucial P310 SSD 500 GB	100 €
HDDs	4 × WD Ultrastar DC HC320 8 TB	530 €
Total		1 850 €

Let’s compare the two ECC options, with similar class CPUs:

Criteria	Build A	Build B	Build C
Total price	1 420 €	1 520 €	1 850 €
Core / Thread	6 / 12	8 / 16	8 / 16
Single thread score	3 059	3 385	4 656
CPU score	19 321	26 608	37 180
PCIe slot	4.0	4.0	5.0
RAM frequency	3 200 MHz	2 666 MHz	5 600 MHz
ECC	❌	✅	✅

AM4 is almost 10 years old now—and while the platform continues to be supported with new releases like the Ryzen 5 5500X3D in June 2025—AMD has mostly transitioned to AM5. The latter is newer, being released in September 2023. AMD officially stated supporting AM5 throughout 2027. Thus, the newer platform is more future-proof.

The motherboard offers a PCIe 5.0 slot and M.2 Gen5 whereas AM4 is limited to PCIe 4.0 and M.2 Gen4. We also have a 2.5 Gb/s LAN port instead of a 1 Gb/s one.

We can start with 4 HDDs in RAIZ2, which gives us 16 TB of usable storage. This can be expanded in the future to

24 TB with 5 HDDs
32 TB with 6 HDDs
40 TB with 7 HDDs
48 TB with 8 HDDs

Core components for the backup server

I’m going to use a ZimaBlade 3760, an inexpensive single board x86 computer. The small form factor is reminiscent of the Raspberry Pi, but it as many advantages over it:

The price is basically the same as a Raspberry Pi 4, but it’s way more powerful.
Same power consumption at IDLE, slightly higher than a Raspberry Pi 4 when under load.
x86 architecture which means it’s highly compatible.
It has a PCIe 2.0 x4 slot, 2 SATA ports, and a DDR4 SO-DIMM slot for extensibility.
It has a regular UEFI BIOS, which makes installing another OS easy.

So as long as the use case is a nice low-power server, this is a better choice than a Raspberry Pi. The Pi is better suited for a home automation server with it’s IO capabilities.

Networking and connectivity

MikroTik Hex routers

This section is a work in progress.

Come back later :)

UPS

BX1600MI-FR

This section is a work in progress.

Come back later :)

Setting up the physical hardware

This section is a work in progress.

Photos

Take a picture of the drives before inserting them into the server. Note which bay they are in.
I was surprised that the cooler wasn’t aligned with the CPU but offset by 0.7mm. I thought something was wrong but actually it’s because the CPU has a chiplet design with most of the heat being produced at the bottom of the chip. See this video for more details.
BIOS Flashback: https://www.youtube.com/watch?v=fqKs9fekNNY

Drive location for the main server:

Row	Column 1	Column 2
Row 1	VDJR3AGK	VDJRVSZK
Row 2	—	—
Row 3	—	—
Row 4	VDKTSX3K	VDJRUKVK

SSD location for the main server:

The CT500P310SSD8 is placed at the bottom right M.2 slot, right under the chipset.
The Samsung SSD 970 EVO Plus 500GB is placed on a PCIe to M.2 adapter card.

This section is a work in progress.

Show diagrams of the networking and the power delivery through the UPS + smart plug.

Configuring the software

Base OS and environment

Installing Debian

This section is a work in progress.

Make another post on how to install and use Ventoy

Download the netinstall image
Burn it to a USB stick
Connect the USB stick to the server, make sure the Ethernet cable is also connected.
Boot into the installer. Select either “Install” or “Graphical install”.
Language: English
Locale: United States (en_US)
Keymap: French
Hostname: melchior
Keep domain name empty
Provide a password for the root user
Confirm the password
Create a new user: [USERNAME]
Provide a password for the [USERNAME] user
Use “Guided: use entire disk” to partition the disk
Select the right disk to use, based on the size and name of the disk
Select “All files in one partition (recommended for beginners)”
Write changes to disk: yes
Keep default settings for the installation of the packages
Software selection: only keep “SSH server” selected
Install GRUB boot loader to your primary drive: yes
Select the same disk as before for the boot loader installation
When it says “Installation complete”, remove the USB stick and press continue to reboot the server

Set static IP

nano /etc/network/interfaces

You’ll see something like this:

# The primary network interface
allow-hotplug enp6s0
iface enp6s0 inet dhcp

Replace the dhcp with static and add the following lines:

# The primary network interface
allow-hotplug enp6s0
iface enp6s0 inet dhcp
iface enp6s0 inet static
        address 192.168.70.67
        netmask 255.255.255.0
        gateway 192.168.70.1

Reboot the server to apply the changes.

Synchronize time

apt-get install systemd-timesyncd
systemctl enable systemd-timesyncd --now

You can check it’s synchronized with:

timedatectl

               Local time: Sun 2025-11-02 16:15:39 CET
           Universal time: Sun 2025-11-02 15:15:39 UTC
                 RTC time: Sun 2025-11-02 15:15:38
                Time zone: Europe/Paris (CET, +0100)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

SSH should already be installed and running because we selected it during the installation. But let’s secure it further.

Use a SSH key instead of a password

On your personal computer, generate a new SSH key:

ssh-keygen

Give it a name: id_melchior

I would also recommend to use a passphrase for the key as a second factor of authentication. If your device is ever stolen, the attacker will not be able to use the key without the passphrase.

Copy the public key to the server:

ssh-copy-id tom@192.168.70.67

Test connecting with the key.

ssh -i ~/.ssh/id_melchior tom@192.168.70.67

If everything is working, we can now disable password authentication.

nano /etc/ssh/sshd_config

Uncomment LoginGraceTime and set it to 20s
Uncomment PermitRootLogin and change it to no
Uncomment StrictModes yes
Uncomment MaxAuthTries and set it to 3
Uncomment PubkeyAuthentication yes
Add AuthenticationMethods publickey right below
Set PasswordAuthentication to no
Keep KbdInteractiveAuthentication no
Keep UsePAM yes
Set X11Forwarding to no

Validate syntax first:

sshd -t

Restart the service:

systemctl restart ssh

Do not close the current SSH connection

In case something goes wrong, we want to be able to revert the changes. Only close the SSH connection if you validated everything is working.

Try in a new terminal to connect without the key:

ssh tom@192.168.70.67

It should say something like:

tom@192.168.70.67: Permission denied (publickey).

Now try with the key:

ssh -i ~/.ssh/id_melchior tom@192.168.70.67

Enter the passphrase if you used one.

If it works, we are good to go. We know that password authentication is disabled and that we can still connect with the key. To make it easier to connect, we can add a shortcut to the ~/.ssh/config file.

nano ~/.ssh/config

Add the following:

Host melchior
    Hostname 192.168.70.67
    User tom
    IdentityFile ~/.ssh/id_melchior

Now you can connect with just:

ssh melchior

Change the non-root and root passwords

passwd

Consider that you may have to enter these passwords without access to the clipboard. I would recommend using passphrase for these passwords to make entering them easier while keeping the strength high.

Containerization setup

Installing Docker

apt-get update
apt-get install ca-certificates curl
install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc
chmod a+r /etc/apt/keyrings/docker.asc
 
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/debian \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  tee /etc/apt/sources.list.d/docker.list > /dev/null
apt-get update
apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

You can check the status of the Docker service with:

systemctl status docker

Container orchestration

What I’m looking for in a Docker management tool/orchestrator:

List containers, images, networks, volumes
View utilization stats: CPU, memory, storage, network
View logs, enter terminal inside the container when possible
Create, view, edit projects (docker compose)
Perform simple operations like build, start, pause, stop, restart, on containers and projects
List and get notified of image updates
(Nice to have) Support for multiple hosts

This section is a work in progress.

Come back later :)

Configuration for the main server

Install ZFS

Add backports to your sources.list:

apt install lsb-release
codename=$(lsb_release -cs);echo "deb http://deb.debian.org/debian $codename-backports main contrib non-free"|tee -a /etc/apt/sources.list

Install the packages:

apt update
apt install linux-headers-amd64
apt install -t stable-backports zfsutils-linux

Press Enter in the license agreement prompt.

Create the HDD pool

Get the list of disks’ UUIDs:

ls -l /dev/disk/by-id/

lrwxrwxrwx 1 root root  9 Sep 21 13:41 wwn-0x5000cca0bbe63bb3 -> ../../sdd
lrwxrwxrwx 1 root root  9 Sep 21 13:41 wwn-0x5000cca0bbe68f2e -> ../../sdc
lrwxrwxrwx 1 root root  9 Sep 21 13:41 wwn-0x5000cca0bbe693ad -> ../../sdb
lrwxrwxrwx 1 root root  9 Sep 21 13:41 wwn-0x5000cca0bbf58929 -> ../../sda

Determine the right ashift value:

lsblk -o NAME,PHY-SeC,SIZE,TYPE /dev/sd[a-d]

NAME PHY-SEC  SIZE TYPE
sda     4096  7.3T disk
sdb     4096  7.3T disk
sdc     4096  7.3T disk
sdd     4096  7.3T disk

We see that the PHY-SEC is 4 096. This gives us ashift=12 (2^12 = 4 096).

Create the RAIDZ2 pool (this will erase everything on these drives. Make sure there’s nothing you care about on them):

zpool create \
  -o ashift=12 \
  -O acltype=posixacl \
  -O xattr=sa \
  -O compression=zstd \
  -O atime=off \
  -O relatime=off \
  -O normalization=formD \
  -m /data \
  data raidz2 \
  /dev/disk/by-id/wwn-0x5000cca0bbe63bb3 \
  /dev/disk/by-id/wwn-0x5000cca0bbe68f2e \
  /dev/disk/by-id/wwn-0x5000cca0bbe693ad \
  /dev/disk/by-id/wwn-0x5000cca0bbf58929

You can add -f to force the creation of the pool (it may be necessary if the disks already have a filesystem on them).

Create the SSD pool

lsblk -f

NAME        FSTYPE     FSVER LABEL UUID                                 FSAVAIL FSUSE% MOUNTPOINTS
sda
├─sda1      zfs_member 5000  data  3786879799329643345
└─sda9
sdb
├─sdb1      zfs_member 5000  data  3786879799329643345
└─sdb9
sdc
├─sdc1      zfs_member 5000  data  3786879799329643345
└─sdc9
sdd
├─sdd1      zfs_member 5000  data  3786879799329643345
└─sdd9
nvme1n1
nvme2n1
├─nvme2n1p1 vfat       FAT32       1083-1133                             965.3M     1% /boot/efi
├─nvme2n1p2 ext4       1.0         0201fa06-8444-490c-ba9d-c926c5010a94  393.7G     4% /
└─nvme2n1p3 swap       1           5c7b6301-be42-4e3f-9365-6672853ac470                [SWAP]
nvme0n1
└─nvme0n1p1 ext4       1.0         c4a339bb-78bc-4156-9232-70e9324a1ff1

Here I can see that my 4 disks are listed as zfs_members. The OS NVMe is nvme2n1 because that’s the one with the mounted partitions (boot and root). So I’ll create a new pool with nvme0n1 and nvme1n1.

I can get their ids with this command:

ls -l /dev/disk/by-id/ | grep -w nvme[0-1]n1

lrwxrwxrwx 1 root root 13 Oct  1 20:19 nvme-CT500P310SSD8_25164FAF1C4F -> ../../nvme0n1
lrwxrwxrwx 1 root root 13 Oct  1 20:19 nvme-CT500P310SSD8_25164FAF1C4F_1 -> ../../nvme0n1
lrwxrwxrwx 1 root root 13 Oct  1 20:19 nvme-eui.0025385691b4ebc8 -> ../../nvme1n1
lrwxrwxrwx 1 root root 13 Oct  1 20:19 nvme-eui.00a075014faf1c4f -> ../../nvme0n1
lrwxrwxrwx 1 root root 13 Oct  1 20:19 nvme-Samsung_SSD_970_EVO_Plus_500GB_S4EVNF0M698680B -> ../../nvme1n1
lrwxrwxrwx 1 root root 13 Oct  1 20:19 nvme-Samsung_SSD_970_EVO_Plus_500GB_S4EVNF0M698680B_1 -> ../../nvme1n1

For NVMe disks, it’s common to find multiple symlinks. Any of them should work. I’ll choose the human readable ones (CT500 and Samsung) without the “_1” suffix.

zpool create \
  -o ashift=12 \
  -O acltype=posixacl \
  -O xattr=sa \
  -O compression=zstd \
  -O atime=off \
  -O relatime=off \
  -O normalization=formD \
  -m /fast \
  fast mirror \
  /dev/disk/by-id/nvme-CT500P310SSD8_25164FAF1C4F \
  /dev/disk/by-id/nvme-Samsung_SSD_970_EVO_Plus_500GB_S4EVNF0M698680B

Setup ZFS snapshots and retention policy

Install sanoid:

apt install sanoid

If it doesn’t already exists, create the sanoid folder in /etc:

mkdir /etc/sanoid

Edit the config file:

nano /etc/sanoid/sanoid.conf

[hdd]
    use_template = production
 
[ssd]
    use_template = production
 
[template_production]
    frequently = 0          # every 15 minutes (disabled)
    hourly = 24             # keep 24 hourly snapshots
    daily = 7               # keep 7 daily snapshots
    weekly = 4              # keep 4 weekly snapshots
    monthly = 6             # keep 6 monthly snapshots
    autosnap = yes
    autoprune = yes

Then enable Sanoid with this command:

systemctl enable --now sanoid.timer

Check that it’s running properly with

systemctl status sanoid.timer

List the snapshots with

zfs list -t snapshot

Some info:

Sanoid runs (by default) every 15 minutes via sanoid.timer.
If autosnap is enabled, it will take snapshot periodically. The frequency is determined from the configuration (here, it will take a new snapshot every hour).
Sanoid completely ignore manually taken snapshot when it comes to its snapshot sheduling and retention policy. It only cares about snapshot that follow its naming scheme.
You can force Sanoid to take a snapshot right now: sanoid -c /etc/sanoid/sanoid.conf --run hdd. This one will be subject to Sanoid retention policy.
You can test run the config using sanoid --debug

Other useful ZFS commands

Rename a pool

Warning: export will unmount the pool. Here I’ll raname data to hdd.

zpool export data
zpool import data hdd

Change mountpoint

zfs set mountpoint=/services ssd

Import pool from another server

If the pools were already created on another server, you can import them with:

zpool import

  pool: ssd
    id: 4664139631714187601
 state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
        the '-f' flag.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-EY
config:
 
        ssd                                                      ONLINE
          mirror-0                                               ONLINE
            nvme-CT500P310SSD8_25164FAF1C4F                      ONLINE
            nvme-Samsung_SSD_970_EVO_Plus_500GB_S4EVNF0M698680B  ONLINE
 
  pool: hdd
    id: 3786879799329643345
 state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
        the '-f' flag.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-EY
config:
 
        hdd                         ONLINE
          raidz2-0                  ONLINE
            wwn-0x5000cca0bbe63bb3  ONLINE
            wwn-0x5000cca0bbe68f2e  ONLINE
            wwn-0x5000cca0bbe693ad  ONLINE
            wwn-0x5000cca0bbf58929  ONLINE

Create the mounting points:

mkdir /data /services

Import each pool:

zpool import -f hdd
zpool import -f ssd

Check that the pools are imported:

zpool list

Upgrade a pool

When checking the status of a pool (e.g: zpool status ssd), you may see this message:

Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.

This happens because ZFS running on the host is newer than when the pool was created. This is normal and it continue to work because ZFS is retrocompatible with older pools.

To upgrade the pool, run:

  zpool upgrade ssd

This procedure is quick and safe. Only the pool metadata is upgraded, not the data. The only downside is that—while ZFS is retrocompatible—if a system runs a version of ZFS older than the pool, it will not be able to import it.

JavaScript

apt install nodejs npm

This section is a work in progress.

Integrate NVM in the mix

PM2 process manager

Install:

npm install -g pm2

Logrotate is a module for PM2 that automatically manages and rotates log files to prevent them from consuming too much disk space.

pm2 install pm2-logrotate

To preserve the running services after a reboot:

pm2 startup
pm2 save

Run pm2 save after adding or removing services to preserve the configuration after a reboot.

Other useful tools

To monitor system resources:

apt install lm-sensors smartmontools btop

To easily explore the filesystem and see what’s taking up space:

apt install ncdu

Configuration for the bastion server

Follow the Base OS and environment section
Follow the Containerization setup section
No need for ZFS on the bastion server

Install Pangolin

curl -fsSL https://digpangolin.com/get-installer.sh | bash
./installer

Do you want to install Pangolin as a cloud-managed (beta) node? (yes/no): no
Enter your base domain (no subdomain e.g. example.com): barillot.net
Enter the domain for the Pangolin dashboard (default: pangolin.barillot.net):
Enter email for Let's Encrypt certificates: letsencrypt@barillot.net
Do you want to use Gerbil to allow tunneled connections (yes/no) (default: yes):
 
=== Email Configuration ===
Enable email functionality (SMTP) (yes/no) (default: no):
 
=== Advanced Configuration ===
Is your server IPv6 capable? (yes/no) (default: yes):
 
=== Generating Configuration Files ===
 
Configuration files created successfully!
 
=== Starting installation ===
Would you like to install and start the containers? (yes/no) (default: yes):
Would you like to run Pangolin as Docker or Podman containers? (default: docker):

Add captcha

We will configure the Traefik CrowdSec bouncer plugin to serve a Cloudflare Turnstile CAPTCHA challenge instead of a plain 403 when CrowdSec issues a captcha decision.

Get Turnstile API keys

Log in to dash.cloudflare.com
In the left sidebar, go to Turnstile
Click Add widget
Give it a name and add your domain(s)
Copy the Site Key and Secret Key

Download the captcha HTML template

The bouncer plugin needs an HTML file to render the Turnstile widget. Download it into your Traefik config directory:

curl -o ./config/traefik/captcha.html \
  https://raw.githubusercontent.com/maxlerebourg/crowdsec-bouncer-traefik-plugin/main/examples/captcha/captcha.html

Mount the HTML file into the Traefik container

In your docker-compose.yml, add a volume mount to the traefik service:

traefik:
  volumes:
    - ./config/traefik:/etc/traefik:ro
    - ./config/letsencrypt:/letsencrypt
    - ./config/traefik/logs:/var/log/traefik
    - ./config/traefik/captcha.html:/captcha.html

Restart the container:

docker compose down && docker compose up

Add captcha settings to the CrowdSec middleware

In ./config/traefik/dynamic_config.yml, add the following lines to the crowdsec plugin block:

http:
  middlewares:
    crowdsec:
      plugin:
        crowdsec:
          # ... your existing config ...
          captchaProvider: turnstile
          captchaSiteKey: YOUR_SITE_KEY
          captchaSecretKey: YOUR_SECRET_KEY
          captchaGracePeriodSeconds: 1800
          captchaHTMLFilePath: /captcha.html

Test it

Add a temporary captcha decision against your own IP:

docker exec crowdsec cscli decisions add --ip YOUR_IP --duration 2m --type captcha --reason "testing turnstile"

Configuration for the backup server

Follow the Base OS and environment section
Follow the Containerization setup section

Install ZFS

Add backports to your sources.list:

apt install lsb-release
codename=$(lsb_release -cs);echo "deb http://deb.debian.org/debian $codename-backports main contrib non-free"|tee -a /etc/apt/sources.list

Install the packages:

apt update
apt install linux-headers-amd64
apt install -t stable-backports zfsutils-linux

Press Enter in the license agreement prompt.

Create the backup pool

Get the list of disks’ UUIDs:

ls -l /dev/disk/by-id/
total 0
lrwxrwxrwx 1 root root  9 Mar 25 20:12 ata-Samsung_SSD_860_EVO_250GB_S3YJNX0M510252Z -> ../../sda
lrwxrwxrwx 1 root root 10 Mar 25 20:12 ata-Samsung_SSD_860_EVO_250GB_S3YJNX0M510252Z-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Mar 25 20:12 ata-Samsung_SSD_860_EVO_250GB_S3YJNX0M510252Z-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 Mar 25 20:12 ata-Samsung_SSD_860_EVO_250GB_S3YJNX0M510252Z-part3 -> ../../sda3
lrwxrwxrwx 1 root root  9 Mar 25 20:12 ata-WDC_WD200EDGZ-11CNKA0_SCH0RX2S -> ../../sdb
lrwxrwxrwx 1 root root 10 Mar 25 20:12 ata-WDC_WD200EDGZ-11CNKA0_SCH0RX2S-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 13 Mar 25 20:12 mmc-C9A551_0xec1635d0 -> ../../mmcblk0
lrwxrwxrwx 1 root root 15 Mar 25 20:12 mmc-C9A551_0xec1635d0-part1 -> ../../mmcblk0p1
lrwxrwxrwx 1 root root 15 Mar 25 20:12 mmc-C9A551_0xec1635d0-part2 -> ../../mmcblk0p2
lrwxrwxrwx 1 root root 15 Mar 25 20:12 mmc-C9A551_0xec1635d0-part3 -> ../../mmcblk0p3
lrwxrwxrwx 1 root root  9 Mar 25 20:12 usb-WD_Elements_25A3_5343483052583253-0:0 -> ../../sdb
lrwxrwxrwx 1 root root 10 Mar 25 20:12 usb-WD_Elements_25A3_5343483052583253-0:0-part1 -> ../../sdb1
lrwxrwxrwx 1 root root  9 Mar 25 20:12 wwn-0x5000cca425ce6d7d -> ../../sdb
lrwxrwxrwx 1 root root 10 Mar 25 20:12 wwn-0x5000cca425ce6d7d-part1 -> ../../sdb1
lrwxrwxrwx 1 root root  9 Mar 25 20:12 wwn-0x5002538e40fbbf28 -> ../../sda
lrwxrwxrwx 1 root root 10 Mar 25 20:12 wwn-0x5002538e40fbbf28-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Mar 25 20:12 wwn-0x5002538e40fbbf28-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 Mar 25 20:12 wwn-0x5002538e40fbbf28-part3 -> ../../sda3

We are going to use the external HDD. I could use usb-WD_Elements_25A3_5343483052583253-0:0 or wwn-0x5000cca425ce6d7d.

zpool create \
  -o ashift=12 \
  -O acltype=posixacl \
  -O xattr=sa \
  -O compression=zstd \
  -O atime=off \
  -O relatime=off \
  -O normalization=formD \
  -m /backup \
  backup \
  /dev/disk/by-id/wwn-0x5000cca425ce6d7d

Setup snapshots replication and retention policy

Local setup

For this section, we will consider that both the main server and backup server are initially running on the same local network. We will then use Pangolin to tunnel the SSH connection between the two servers, making it possible to do remote backups. It should be possible to setup everything remotely from the get go if you configure the pangolin-cli first.

On the main server, create a new user

adduser --disabled-password --gecos "Syncoid replication user" syncoid

Give this user the permissions required to replicate the ZFS pools:

zfs allow syncoid send,snapshot,hold hdd
zfs allow syncoid send,snapshot,hold ssd

On the backup server, create a new SSH key for the syncoid user (don’t set a passphrase):

ssh-keygen -f /root/.ssh/id_syncoid

Because we already blocked password authentication, we can’t use the ssh-copy-id command. Instead we can manually copy the public key to the main server:

cat /root/.ssh/id_syncoid.pub

Paste it on the main server:

mkdir -p /home/syncoid/.ssh
nano /home/syncoid/.ssh/authorized_keys

Set the permissions:

chown -R syncoid:syncoid /home/syncoid/.ssh
chmod 700 /home/syncoid/.ssh
chmod 600 /home/syncoid/.ssh/authorized_keys

At this point, you should be able to connect to the main server from the backup server as the syncoid user:

ssh -i /root/.ssh/id_syncoid syncoid@192.168.70.67 zfs list

Let’s add this connection to the ~/.ssh/config file:

nano ~/.ssh/config

Add the following:

Host melchior
    Hostname 192.168.70.67
    User syncoid
    IdentityFile ~/.ssh/id_syncoid

Now you can connect with just:

ssh melchior

Install Sanoid and Syncoid on the backup server

Install sanoid:

apt install sanoid

If it doesn’t already exists, create the sanoid folder in /etc:

mkdir /etc/sanoid

Edit the config file:

nano /etc/sanoid/sanoid.conf

[backup/hdd]
    use_template = backup
 
[backup/ssd]
    use_template = backup
 
[template_backup]
    hourly = 0
    daily = 0
    weekly = 12
    monthly = 24
    yearly = 6
    autosnap = no
    autoprune = yes

Note the autosnap = no, it means that no snapshots will be taken automatically, it will only be replicated snapshots coming from the main server. But we will still use Sanoid for its retention policy.

Then enable Sanoid with this command:

systemctl enable --now sanoid.timer

Check that it’s running properly with

systemctl status sanoid.timer

You can now try to run the replication process manually (this can take a while if the pools are large):

syncoid --recursive --no-privilege-elevation --no-sync-snap melchior:ssd backup/ssd
syncoid --recursive --no-privilege-elevation --no-sync-snap melchior:hdd backup/hdd

Automating the replication process

On the backup server, create a script:

nano /root/run_syncoid.sh

#!/bin/bash
set -e
 
echo "[1/4] Reconnect the external HDD"
zpool import backup
 
echo "[2/4] Replicate the snapshots from the main server"
syncoid --recursive --no-privilege-elevation --no-sync-snap melchior:ssd backup/ssd
syncoid --recursive --no-privilege-elevation --no-sync-snap melchior:hdd backup/hdd
 
echo "[3/4] Prune old/unwanted local snapshots"
sanoid --cron --verbose
 
echo "[4/4] Disconnect the Zpool and make the external HDD enter sleep mode"
zpool export backup
hdparm -y /dev/disk/by-id/wwn-0x5000cca425ce6d7d

Let’s create the systemd services and timers:

nano /etc/systemd/system/syncoid.service

[Unit]
Description=Syncoid backup to external drive
After=network.target
 
[Service]
Type=oneshot
ExecStart=/root/run_syncoid.sh
User=root
StandardOutput=journal
StandardError=journal

nano /etc/systemd/system/syncoid.timer

[Unit]
Description=Run Syncoid backup weekly
 
[Timer]
OnCalendar=weekly
Persistent=true
ini[Unit]
Description=Run Syncoid backup weekly
 
[Timer]
OnCalendar=weekly
Persistent=true
 
[Install]
WantedBy=timers.target

And finally enable and start the services and timers:

systemctl daemon-reload
systemctl enable syncoid.service
systemctl enable --now syncoid.timer

Tunneled SSH connection

Because we want offsite backups, we need to tunnel the SSH connection between the main server and the backup server. We already have Pangolin, so we can use it to tunnel the SSH connection.

In the Pangolin dashboard, in Networks -> Clients -> Machines -> Create client Give it a name, and select Docker as the “Operating system”. Copy the “Commands” and paste it in a docker-compose.yml file on the backup server.

Confirm the creation on the Pangolin dashboard and then run the container.

docker compose up -d

The client should be reported as “Connected” now.

Now let’s create a Private Ressource to allow the backup server to access the main server via SSH. In the Pangolin dashboard, in Networks -> Ressources -> Private -> Add ressource

Give it a name, select the main server site. Because Newt is running in a docker container, the destination should need to be host.docker.internal. Give it an Alias even if it will not be used (e.g: melchior.internal).

In Port Restictions, use TCP Custom 22, UDP Blocked. You can keep ICMP.

In the Access Policy tab, select the Machine Client we created earlier.

After saving, toggle the Alias Address column to see the IP address to use to connect to the main server.

Copy it and try connecting to the main server via SSH from the backup server:

ssh 100.96.128.8

If it ask for a password, or give a normal SSH response, this is good sign.

Edit /root/.ssh/config and update the Hostname:

Host melchior
    Hostname 192.168.70.67
    Hostname 100.96.128.8
    User syncoid
    IdentityFile ~/.ssh/id_syncoid

And this should now work remotely, even if the main server is not on the same network as the backup server:

syncoid --recursive --no-privilege-elevation --no-sync-snap melchior:ssd backup/ssd
syncoid --recursive --no-privilege-elevation --no-sync-snap melchior:hdd backup/hdd

Hardening the SSH connection

Great, but in case the backup server is compromised, we want to make sure this SSH connection cannot be used to execute arbitrary commands on the main server. Let’s first log which commands are required by syncoid to run properly:

nano /home/syncoid/.ssh/authorized_keys

Add the following:

command="echo \"$SSH_ORIGINAL_COMMAND\" >> /home/syncoid/ssh_commands.log && exec /bin/bash -c \"$SSH_ORIGINAL_COMMAND\"",

Run syncoid again from the backup server and check the log file:

cat /home/syncoid/ssh_commands.log

exit
echo -n
command -v lzop
command -v mbuffer
zpool get -o value -H feature@extensible_dataset 'hdd'
zfs list -o name,origin -t filesystem,volume -Hr 'hdd'
zfs get -H syncoid:sync 'hdd'
zfs get -Hpd 1 -t snapshot guid,creation 'hdd'
zfs send -nvP -I 'hdd@autosnap_2026-03-23_23:30:27_weekly' 'hdd@autosnap_2026-03-28_10:00:10_hourly'
zfs send  -I 'hdd'@'autosnap_2026-03-23_23:30:27_weekly' 'hdd'@'autosnap_2026-03-28_10:00:10_hourly' | lzop  | mbuffer  -q -s 128k -m 16M

We will now create a filter script to block all commands except the ones required by syncoid.

nano /home/syncoid/ssh_filter.sh

#!/bin/bash
set -e
 
LOGFILE="/home/syncoid/ssh_commands.log"
CMD="${SSH_ORIGINAL_COMMAND#"${SSH_ORIGINAL_COMMAND%%[![:space:]]*}"}"
 
log() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') [$1] $CMD" >> "$LOGFILE"
}
 
if [ -z "$CMD" ]; then
    echo "No command provided" >&2
    exit 1
fi
 
case "$CMD" in
    # Connection checks
    "exit")
        log "ALLOWED"
        exit 0
        ;;
    "echo -n")
        log "ALLOWED"
        echo -n
        exit 0
        ;;
    # Tool availability checks
    "command -v lzop"|"command -v mbuffer")
        log "ALLOWED"
        exec /bin/bash -c "$CMD"
        ;;
    # ZFS/zpool commands
    "zfs "*|"zpool "*)
        log "ALLOWED"
        exec /bin/bash -c "$CMD"
        ;;
    *)
        log "BLOCKED"
        logger -t syncoid-ssh "BLOCKED command: $CMD"
        echo "Command not allowed: $CMD" >&2
        exit 1
        ;;
esac

Make sure the script cannot be edited by the syncoid user, but is still executable:

chown root:root /home/syncoid/ssh_filter.sh
chmod 755 /home/syncoid/ssh_filter.sh

Edit the authorized keys file again:

nano /home/syncoid/.ssh/authorized_keys

Replace with the following:

command="~/ssh_filter.sh",

On the backup server, run syncoid again and check if everything is still working.

systemctl start syncoid.service

You can check the logs with:

journalctl -u syncoid

Scrubbing the external HDD

Because the external HDD is only running when the replication process is running, it will never be available for scrubbing. Let’s schedule a monthly scrub of the pool.

On the backup server, create a script:

nano /root/run_scrub.sh

#!/bin/bash
set -e
 
echo "[1/4] Reconnect the external HDD"
zpool import backup
 
echo "[2/4] Scrub the Zpool"
zpool scrub backup
 
echo "[3/4] Wait for scrub to complete"
while zpool status backup | grep -q "scrub in progress"; do
    sleep 30
done
 
echo "[4/4] Disconnect the Zpool and make the external HDD enter sleep mode"
zpool export backup
hdparm -y /dev/disk/by-id/wwn-0x5000cca425ce6d7d

Let’s create the systemd services and timers:

nano /etc/systemd/system/scrub.service

/etc/systemd/system/scrub.service
ini[Unit]
Description=ZFS scrub on backup pool
After=network.target
 
[Service]
Type=oneshot
ExecStart=/root/run_scrub.sh
User=root
StandardOutput=journal
StandardError=journal

nano /etc/systemd/system/scrub.timer

ini[Unit]
Description=Run ZFS scrub monthly
 
[Timer]
OnCalendar=monthly
Persistent=true
 
[Install]
WantedBy=timers.target

And finally enable and start the services and timers:

systemctl daemon-reload
systemctl enable scrub.service
systemctl enable --now scrub.timer

Upgrading Newt

Because the backup server is remote and we can only access it via the Newt tunnel, updating Newt could lead to “cutting the branch on which one is sitting”. If we simply run

docker compose pull
docker compose down && docker compose up

If will stop the Newt tunnel and the SSH connection will stop mid command. The docker compose up will not be executed.

There are a few solutions to this problem, we will use a simple systemd oneshot service to run the update process:

nano /etc/systemd/system/upgrade-newt.service

[Unit]
Description=Upgrade Newt
 
[Service]
WorkingDirectory=/services/newt
Type=oneshot
ExecStart=docker compose pull
ExecStart=docker compose down
ExecStart=docker compose up -d

Whenever a new version is available, you can run:

systemctl start upgrade-newt

Unfortunately, you will still be disconnected:

Read from remote host 100.96.128.8: Connection reset by peer
Connection to 100.96.128.8 closed.
client_loop: send disconnect: Broken pipe

But you should be able to reconnect immediately. After you do, you can check that everything worked properly by checking the logs:

journalctl --since "1 hour ago" -u upgrade-newt

Deploying services

This section is a work in progress.

Come back later :)

Troubleshooting

DNS resolution issues

Apparently at startup, if the container starts before the host’s DNS resolution is ready, it will fail, and sometimes it’s unrecoverable unless you restart the container. To fix this, you can tell docker to use the host’s DNS servers by adding the following to the docker compose file:

volumes:
  - /etc/resolv.conf:/etc/resolv.conf:ro

Alternatively, you can also set DNS servers directly in the container’s network settings:

dns:
  - 1.1.1.1
  - 8.8.8.8

Make sure to restart the container after adding the DNS servers.

docker compose down && docker compose up

Hardening

This section is a work in progress.

Come back later :)

services:
  service-name:
    image: image-name:latest
    container_name: service-name
    restart: unless-stopped
 
    user: "1000:1000"
 
    mem_limit: 512m
    cpus: 0.5
    pids_limit: 100
 
    read_only: true
    tty: false
    stdin_open: false
    cap_drop:
      - ALL
    security_opt:
      - no-new-privileges:true
      tmpfs:
        - /tmp:rw,noexec,nosuid,nodev,size=128m,uid=1000,gid=1000,mode=1777
 
    volumes:
      - ./volume-name:/volume-name
 
    env_file: .env
 
    logging:
      driver: loki
      options:
        loki-url: "http://localhost:3100/loki/api/v1/push"
        loki-retries: 2
        loki-max-backoff: 800ms
        loki-timeout: 1s
        keep-file: "true"
        mode: "non-blocking"
 
    networks:
      - pangolin_net
 
networks:
  internal_net:
    name: internal_net
    driver: bridge
    internal: true
  pangolin_net:
    external: true

About tmpfs params:

noexec: Prevents execution of binaries.
nosuid: Prevents escalation of privileges.
nodev: Prevents character/block devices from being created.
size: Limits the maximum size, help prevent a process from filling up the host memory.

Users, groups, and permissions

For each service, you should create a dedicated user:

groupadd --gid 2000 paperless
useradd --no-create-home --shell /usr/sbin/nologin --uid 2000 --gid 2000 service-name

Stucture and permissions:

/services/service-name/
├── docker-compose.yml # owned by root:root and chmod 640
├── .env # optional, owned by root:root and chmod 600
└── mounts/ # optional, owned by root:root and chmod 750
    └── mounts-name/ # owned by service-name:service-name and chmod 750

And then be careful with the permissions within the mounts themselves.

To set all files to 640 and directories to 750 recursively from the current directory, you can use:

chmod -R 640 .
chmod -R u+X,g+X .

You can check the resource usage of the containers with:

docker stats

Nginx

server {
    listen 80;
    server_name _;
 
    root /usr/share/nginx/html;
    index index.html;
 
    location / {
        try_files $uri $uri/ =404;
    }
}

tmpfs:
  - /tmp:rw,noexec,nosuid,nodev,size=128m,uid=1000,gid=1000,mode=1777
  - /var/cache/nginx:rw,noexec,nosuid,nodev,size=128m,uid=1000,gid=1000,mode=0755
  - /var/run:rw,nosuid,nodev,size=128m,uid=1000,gid=1000,mode=0755
 
volumes:
  - ./default.conf:/etc/nginx/conf.d/default.conf:ro

Monitoring and alerting

The Prometheus/Grafana/Loki stack

This section is a work in progress.

Come back later :)

Configure Loki with Docker plugin

docker plugin install grafana/loki-docker-driver:3.3.2-amd64 --alias loki --grant-all-permissions

Add the following to each docker compose service:

logging:
  driver: loki
  options:
    loki-url: "http://localhost:3100/loki/api/v1/push"
    loki-retries: 2
    loki-max-backoff: 800ms
    loki-timeout: 1s
    keep-file: "true"
    mode: "non-blocking"

Avoid logging Loki or Grafana’s containers as it can create a feedback loop.

Sensors and IoT

SONOFF Zigbee 3.0 USB Dongle Plus MG24, Gateway with EFR32MG24
SONOFF S60ZBTPF Zigbee Smart Plug
SONOFF SNZB-02D Mini ZigBee Smart Temperature Humidity Sensor

You should see the dongle listed in:

ls -l /dev/serial/by-id

lrwxrwxrwx 1 root root 13 Mar 15 12:22 usb-SONOFF_SONOFF_Dongle_Plus_MG24_a4df9b88eba2ef11a7a7906661ce3355-if00-port0 -> ../../ttyUSB0

Create a docker-compose.yml file:

services:
  mosquitto:
    image: eclipse-mosquitto
    container_name: mosquitto
    restart: unless-stopped
    ports:
      - 1883:1883
    volumes:
      - ./mosquitto:/mosquitto/data
    networks:
      - zigbee_net
 
  zigbee2mqtt:
    container_name: zigbee2mqtt
    image: ghcr.io/koenkk/zigbee2mqtt
    restart: unless-stopped
    depends_on:
      - mosquitto
    volumes:
      - ./mounts/data:/app/data
      - /run/udev:/run/udev:ro
    ports:
      - 8080:8080
    environment:
      - TZ=Europe/Paris
    devices:
      - /dev/serial/by-id/usb-SONOFF_SONOFF_Dongle_Plus_MG24_a4df9b88eba2ef11a7a7906661ce3355-if00-port0:/dev/zigbee
    networks:
      - pangolin_net
      - zigbee_net
 
networks:
  zigbee_net:
    name: zigbee_net
    driver: bridge
    internal: true
  pangolin_net:
    external: true

Connect to the frontend

Select the dongle in the “Found Devices” dropdown

Documentation

Labeling the drives

In the event of a drive failure, it’s important to be able to identify the drive quickly. If you accidentally replace the wrong drive, you could end up with data loss. When a drive fails, ZFS will point out the faulty drive by its ID. So we need to match the ID with the drive serial number and location in the server.

zpool status

  pool: hdd
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 02:06:26 with 0 errors on Sun Mar  8 02:30:27 2026
config:
 
        NAME                        STATE     READ WRITE CKSUM
        hdd                         ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            wwn-0x5000cca0bbe63bb3  ONLINE       0     0     0
            wwn-0x5000cca0bbe68f2e  ONLINE       0     0     0
            wwn-0x5000cca0bbe693ad  ONLINE       0     0     0
            wwn-0x5000cca0bbf58929  ONLINE       0     0     0
 
errors: No known data errors
 
  pool: ssd
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:00:35 with 0 errors on Sun Mar  8 00:24:44 2026
config:
 
        NAME                                                     STATE     READ WRITE CKSUM
        ssd                                                      ONLINE       0     0     0
          mirror-0                                               ONLINE       0     0     0
            nvme-CT500P310SSD8_25164FAF1C4F                      ONLINE       0     0     0
            nvme-Samsung_SSD_970_EVO_Plus_500GB_S4EVNF0M698680B  ONLINE       0     0     0

Let’s check what is the serial number of the drives:

apt-get install hdparm

hdparm -I /dev/disk/by-id/wwn-0x5000cca0bbe63bb3 | grep Number:

For NVMe you can use:

smartctl -a /dev/disk/by-id/nvme-CT500P310SSD8_25164FAF1C4F | grep Number:

So now we can match the ID with the drive serial number.

Zpool	ID	Model Number	Serial Number
hdd	wwn-0x5000cca0bbe63bb3	HGST HUS728T8TALE604	VDJR3AGK
hdd	wwn-0x5000cca0bbe68f2e	HGST HUS728T8TALE604	VDJRUKVK
hdd	wwn-0x5000cca0bbe693ad	HGST HUS728T8TALE604	VDJRVSZK
hdd	wwn-0x5000cca0bbf58929	HGST HUS728T8TALE604	VDKTSX3K
ssd	nvme-CT500P310SSD8…	CT500P310SSD8	25164FAF1C4F
ssd	nvme-Samsung_SSD_970_EVO…	Samsung SSD 970 EVO Plus 500GB	S4EVNF0M698680B

To match the serial number with the location in the server, we need to carefully note in which bay the drive was inserted.

Row	Column 1	Column 2
Row 1	VDJR3AGK	VDJRVSZK
Row 2	—	—
Row 3	—	—
Row 4	VDKTSX3K	VDJRUKVK

And finally, we can do the same for the SSDs:

The CT500P310SSD8 is placed at the bottom right M.2 slot, right under the chipset.
The Samsung SSD 970 EVO Plus 500GB is placed on a PCIe to M.2 adapter card.

Now if a disk fails, let’s say wwn-0x5000cca0bbe63bb3, we will be able to know that its serial number is VDJR3AGK and it is located in the first row, first column. We can safely replace it with the spare disk and start the resilvering process.

Keeping a list of your service users and groups

Keep track of the users and groups you create for your services. If you have to recreate the server, the UID and GID will need to be exactly the same for the services to work. That’s why it’s important to explicitly set the UID and GID when creating the users and groups.

groupadd --gid 2000 paperless
groupadd --gid 2001 vaultwarden
groupadd --gid 2002 sure
groupadd --gid 2003 mqtt
 
usermod -aG dialout mqtt
 
useradd --no-create-home --shell /usr/sbin/nologin --uid 2000 --gid 2000 paperless
useradd --no-create-home --shell /usr/sbin/nologin --uid 2001 --gid 2001 vaultwarden
useradd --no-create-home --shell /usr/sbin/nologin --uid 2002 --gid 2002 sure
useradd --no-create-home --shell /usr/sbin/nologin --uid 2003 --gid 2003 mqtt

Showcase

Todo

Add photos and tests that confirm if the homelab is reaching my initial goals.

Conclusion

Somewhere, talk about NAT loopback / hairpin NAT / NAT reflection.

Reflections

The ASRock AM5 fiasco

So, about a month after I built the main server, I decided to reinstall the OS to make sure every step was perfectly documented. Unfortunately, the server never POSTed after that. The POST status checker LEDs for the CPU and DRAM were constantly on. I tried all the usual troubleshooting steps:

unplugging the PSU and waiting a few minutes
clearing the CMOS
reflashing the BIOS
trying another PSU
taking the motherboard out of the case
trying a minimal setup with only one stick of RAM, the PSU, CPU and motherboard

At that point I grew desperate. It was clear either the CPU, motherboard or RAM was faulty. I even took it to a local repair shop to get it checked. They didn’t go to great lengths to diagnose the issue, but on the other hand, they also didn’t charge me anything: so it’s hard to complain. Their conclusion was that the motherboard was at fault. I was also of the opinion that a motherboard failure was more likely.

After that, I resorted to contacting ASRock support. They told me to try isolating the faulty component by trying another known good CPU and RAM on the motherboard. Ideally yes, that’s also something I wanted to do, but this is my first AM5 system ever. And I wasn’t going to buy another CPU and RAM just to test it. They recommended me to contact the retailer I bought the motherboard from, which was Amazon.

I went to Amazon’s customer service to ask for a replacement motherboard. Unfortunately, the new motherboard was still not working. I still had the exact same symptoms.

In conclusion, the motherboard was not the issue. Could it be the CPU? The RAM? At that point, I finally decided to check online if anyone else was having the same issue with their AM5 system.

What I found was that I had been living under a rock. Since early 2025, there has been reports of Ryzen 9000 series CPUs failing, especially on ASRock motherboards. There is a megathread on ASRock’s subreddit that nicely summarizes the situation:

Since February 2025, there has been pictures posted online of scorched CPU pins and motherboard sockets. The initial investigation suggested that only X3D CPUs and ASRock motherboards were affected. ASRock initially blamed the issue on user error or aggressive overclocking, but later made a statement that overvoltage due to bad PBO implementations could be the cause. They later released a few BIOS updates to address the issue.
Unfortunately, since then, users still reported cases of CPUs failing on BIOS updates released after the initial investigation. Meaning the root cause is still not identified.
Beside the scorched pins, the reported symptoms match my case perfectly: the system was working fine until a reboot. And since then, the POST status checker LEDs for the CPU and DRAM were constantly on. Seemingly, the motherboard is not damaged, it’s always a CPU failure.
A Korean user reported having the same problem with a B850 Pro RS motherboard and a grand total of 3 Ryzen 7 9700X CPUs: so it seems the problem is not a rare defect in the CPU, but a systemic issue with the motherboard.
Asus also had similar reports, and made an official statement about the issue.
There has been a few cases of CPU failures on other motherboard brands, and even on the 7000-series CPUs. But the low report rate could indicate that these failures are within a normal defect rate.

In conclusion, even a year later, ASRock has not identified the root cause of the issue. I initially chose ASRock because I’ve had good experiences with their products in the past and—along with Asus—they are one of the few manufacturers that support ECC memory on consumer-grade motherboards.

While the failure rate is low, it’s still unacceptable that hardware costing hundreds of dollars could just burn or stop POSTing randomly. I’m not incline to be ASRock’s beta tester for this long-standing issue.

Also I’m a little salty that the ASRock technician responding to my ticket didn’t mention any of this to me. The symptoms I described match almost perfectly with the other users’ reports. I would have saved myself the hassle of getting a replacement motherboard just to send it back afterwards. Good thing Amazon’s return policy is pretty generous.

Recommendation

Stay away from 9000-series CPUs until reports of the issue are cleared. If you must buy one, avoid ASRock motherboards and Asus motherboards. Keep your motherboard BIOS up to date.

So what about the homelab? After this catastophic failure, I ended up canabalizing my gaming PC to keep the homelab running. Now the server is running on a Ryzen 5 5500 CPU and an Asus TUF Gaming B550-PLUS motherboard.

Other things

Overkill much
ECC seems overkill
Useless second PCI-e slot because the PSU
A slightly longer case would make building it easier and could provide a better airflow.

Future improvements

Main server
- Upgrade to 64 GB of RAM
- Add a 5.5” AMOLED touch screen in the full-height 5.5” bay in the front of the rack mounted case
  - https://fr.aliexpress.com/item/1005004285318699.html
Backup server
- Use ECC memory. This would require a new motherboard and memory sticks. The CPU already supports ECC.

The end is never the end is never…

Experiment with NixOS
Backup the password database to a USB drive securely

(Re)birth of a homelab

1.Introduction

2.Considerations and planning

2.1.Goals

2.2.How to make services accessible from the internet

2.3.Requirements overview

2.4.Operating System (OS)

2.5.Virtual Private Server (VPS)

2.6.Tunneling solution

2.7.Virtualization and containerization

2.8.File system

2.8.1.Introduction to ZFS

2.8.2.Protection against hardware failure

2.8.3.Integrity checks and automatic repairs

2.8.4.Extensibility

2.8.5.Hot spares

2.8.6.Portability

2.8.7.Compression

2.9.Backups

2.9.1.The “3-2-1” rule

2.10.Memory

2.11.Uninterruptible Power Supply (UPS)

2.12.Monitoring

2.13.Scope

2.14.Threats and mitigations

3.Choosing and setting up the hardware

3.1.Buying prebuilt vs building your own

3.2.Core components for the main server

3.2.1.CPU

3.2.2.GPU

3.2.3.Motherboard

3.2.4.Memory

3.2.4.1.CPU

3.2.4.2.Motherboard

3.2.4.3.Memory stick

3.2.5.Storage

3.2.5.1.Shucking drives

3.2.6.PSU

3.2.7.Cases

3.2.7.1.Tower cases

3.2.7.2.Rack mounted

3.2.8.Complete builds

3.2.8.1.Build A: cheapest AM4, no ECC

3.2.8.2.Build B: AM4, with ECC memory

3.2.8.3.Build C: AM5, with ECC memory

3.3.Core components for the backup server

3.4.Networking and connectivity

3.5.UPS

3.6.Setting up the physical hardware

4.Configuring the software

4.1.Base OS and environment

4.1.1.Installing Debian

4.1.2.Set static IP

4.1.3.Synchronize time

4.1.4.Setup SSH and local login

4.1.4.1.Use a SSH key instead of a password

4.1.4.2.Change the non-root and root passwords

4.2.Containerization setup

4.2.1.Installing Docker

4.2.2.Container orchestration

4.3.Configuration for the main server

4.3.1.Install ZFS

4.3.2.Create the HDD pool

4.3.3.Create the SSD pool

4.3.4.Setup ZFS snapshots and retention policy

4.3.5.Other useful ZFS commands

4.3.5.1.Rename a pool

4.3.5.2.Change mountpoint

4.3.5.3.Import pool from another server

4.3.5.4.Upgrade a pool

4.3.6.JavaScript

4.3.7.PM2 process manager

4.3.8.Other useful tools

4.4.Configuration for the bastion server

4.4.1.Install Pangolin

4.4.2.Add captcha

4.4.2.1.Get Turnstile API keys

4.4.2.2.Download the captcha HTML template

4.4.2.3.Mount the HTML file into the Traefik container

4.4.2.4.Add captcha settings to the CrowdSec middleware

Introduction

Considerations and planning

Goals

How to make services accessible from the internet

Requirements overview

Operating System (OS)

Virtual Private Server (VPS)

Tunneling solution

Virtualization and containerization

File system

Introduction to ZFS

Protection against hardware failure

Integrity checks and automatic repairs

Extensibility

Hot spares

Portability

Compression

Backups

The “3-2-1” rule

Memory

Uninterruptible Power Supply (UPS)

Monitoring

Scope

Threats and mitigations

Choosing and setting up the hardware

Buying prebuilt vs building your own

Core components for the main server

CPU

GPU

Motherboard

Memory

CPU

Motherboard

Memory stick

Storage

Shucking drives

PSU

Cases

Tower cases

Rack mounted

Complete builds

Build A: cheapest AM4, no ECC

Build B: AM4, with ECC memory

Build C: AM5, with ECC memory

Core components for the backup server

Networking and connectivity

UPS

Setting up the physical hardware

Configuring the software

Base OS and environment

Installing Debian

Set static IP

Synchronize time

Setup SSH and local login

Use a SSH key instead of a password

Change the non-root and root passwords

Containerization setup

Installing Docker

Container orchestration

Configuration for the main server

Install ZFS

Create the HDD pool

Create the SSD pool

Setup ZFS snapshots and retention policy

Other useful ZFS commands

Rename a pool

Change mountpoint

Import pool from another server

Upgrade a pool

JavaScript

PM2 process manager

Other useful tools

Configuration for the bastion server

Install Pangolin

Add captcha

Get Turnstile API keys

Download the captcha HTML template

Mount the HTML file into the Traefik container

Add captcha settings to the CrowdSec middleware

Test it