Table of Contents
- 1. Introduction
- 2. Considerations and planning
- 2.1. Goals
- 2.2. How to make services accessible from the internet
- 2.3. Requirements overview
- 2.4. Operating System (OS)
- 2.5. Virtual Private Server (VPS)
- 2.6. Tunneling solution
- 2.7. Virtualization and containerization
- 2.8. File system
- 2.8.1. Introduction to ZFS
- 2.8.2. Protection against hardware failure
- 2.8.3. Integrity checks and automatic repairs
- 2.8.4. Extensibility
- 2.8.5. Hot spares
- 2.8.6. Portability
- 2.8.7. Compression
- 2.9. Backups
- 2.9.1. The “3-2-1” rule
- 2.10. Memory
- 2.11. Uninterruptible Power Supply (UPS)
- 2.12. Monitoring
- 2.13. Scope
- 2.14. Threats and mitigations
- 3. Choosing and setting up the hardware
- 3.1. Buying prebuilt vs building your own
- 3.2. Core components for the main server
- 3.2.1. CPU
- 3.2.2. GPU
- 3.2.3. Motherboard
- 3.2.4. Memory
- 3.2.4.1. CPU
- 3.2.4.2. Motherboard
- 3.2.4.3. Memory stick
- 3.2.5. Storage
- 3.2.5.1. Shucking drives
- 3.2.6. PSU
- 3.2.7. Cases
- 3.2.7.1. Tower cases
- 3.2.7.2. Rack mounted
- 3.2.8. Complete builds
- 3.2.8.1. Build A: cheapest AM4, no ECC
- 3.2.8.2. Build B: AM4, with ECC memory
- 3.2.8.3. Build C: AM5, with ECC memory
- 3.3. Core components for the backup server
- 3.4. Networking and connectivity
- 3.5. UPS
- 3.6. Setting up the physical hardware
- 4. Configuring the software
- 4.1. Base OS and environment
- 4.1.1. Installing Debian
- 4.1.2. Set static IP
- 4.1.3. Synchronize time
- 4.1.4. Setup SSH and local login
- 4.1.4.1. Use a SSH key instead of a password
- 4.1.4.2. Change the non-root and root passwords
- 4.2. Containerization setup
- 4.2.1. Installing Docker
- 4.2.2. Container orchestration
- 4.3. Configuration for the main server
- 4.3.1. Install ZFS
- 4.3.2. Create the HDD pool
- 4.3.3. Create the SSD pool
- 4.3.4. Setup ZFS snapshots and retention policy
- 4.3.5. Other useful ZFS commands
- 4.3.5.1. Rename a pool
- 4.3.5.2. Change mountpoint
- 4.3.5.3. Import pool from another server
- 4.3.5.4. Upgrade a pool
- 4.3.6. JavaScript
- 4.3.7. PM2 process manager
- 4.3.8. Other useful tools
- 4.4. Configuration for the bastion server
- 4.4.1. Install Pangolin
- 4.4.2. Add captcha
- 4.4.2.1. Get Turnstile API keys
- 4.4.2.2. Download the captcha HTML template
- 4.4.2.3. Mount the HTML file into the Traefik container
- 4.4.2.4. Add captcha settings to the CrowdSec middleware
- 4.4.2.5. Test it
- 4.5. Configuration for the backup server
- 4.5.1. Install ZFS
- 4.5.2. Create the backup pool
- 4.5.3. Setup snapshots replication and retention policy
- 4.5.3.1. Install Sanoid and Syncoid on the backup server
- 4.5.3.2. Automating the replication process
- 4.5.3.3. Tunneled SSH connection
- 4.5.3.4. Hardening the SSH connection
- 4.5.4. Scrubbing the external HDD
- 4.5.5. Upgrading Newt
- 4.6. Deploying services
- 4.6.1. Troubleshooting
- 4.6.1.1. DNS resolution issues
- 4.6.2. Hardening
- 4.6.2.1. Users, groups, and permissions
- 4.6.2.2. Ressource sharing
- 4.6.2.3. Nginx
- 4.7. Monitoring and alerting
- 4.7.1. The Prometheus/Grafana/Loki stack
- 4.7.2. Configure Loki with Docker plugin
- 4.7.3. Sensors and IoT
- 5. Documentation
- 5.1. Labeling the drives
- 5.2. Keeping a list of your service users and groups
- 6. Showcase
- 7. Conclusion
- 7.1. Reflections
- 7.1.1. The ASRock AM5 fiasco
- 7.1.2. Other things
- 7.2. Future improvements
- 7.3. The end is never the end is never…
- 7.4. Other resources
Introduction
Hi! Do you enjoy spending time, money, and energy tinkering with servers at home when you could get paid for it? Or maybe you already are a sysadmin, but you want to learn and experiment in a more casual environment? Oooor maybe, you want to stick it to the tech overlords and take back control on your data and everyday tools?
Whatever your reason, you are not alone. I’ve been feeling the same pull toward owning more of my digital life. Part of it is the simple joy of building things, but also, I can’t help but notice how the internet is changing.
Between the UK’s Online Safety Act, the various laws passing in the US, and the recurring “Chat Control” proposals in the EU, it’s impossible to ignore the trend: anonymity and privacy are increasingly under threat.
And on top of that, we’re watching major platforms go through the familiar cycle of enshittification. Propelled by an established and enthusiastic userbase, they go public or get bought up by a bigger company. Months or years later, shareholders or the acquiring company start pushing the platform to squeeze out more profit. That’s when the decline begins: long-standing features disappear without warning, or get locked behind a “premium” subscription tier. Ads creep in, your data becomes training fodder for AI models, and open ecosystems are walled off by blocking third-party clients and tools.
In the wake of such trends, it’s up to us to reclaim autonomy over our digital data and the everyday tools we are depending on.
At its core, a homelab is a personal playground where you can learn and experiment with servers, applications, and services. It goes hand in hand with another concept, self-hosting: the practice of running services yourself, on hardware you control, instead of relying on big tech. Together, it means that you can build your own IT infrastructure, tailor the tools and services you use to your needs and keep your data safe and private.
If this sounds interesting to you, welcome aboard! I would like to share my journey with self-hosting, raise awareness about important considerations, and guide you through the process. As an appetizer, here’s the list of services I like to run on mine:
- Vaultwarden: a lightweight password manager, alternative to Bitwarden.
- File browser: organize, share, and control access to your files.
- Syncthing: synchronize files between across devices, perfect for personal backups.
- Jellyfin: stream movies and series privately, like a personal Netflix.
- Navidrome: private music streaming, similar to Spotify.
- Immich: manage and organize your photos and videos, like a private Google Photos.
- Sure: track and categorize expenses to manage your finances.
- Paperless: organize, search, and share documents, perfect for archiving invoices, medical records, and other important files.
- This very blog: my own personal space to express myself and share my knowledge.
With the basics out of the way and a taste of what’s possible, it’s time to roll up our sleeves. Let’s start building a homelab!
I wrote this guide as a way to document my journey. As such, it was written over the course of multiple months while building the homelab. I ended having some troubles, especially with my hardware choices. Make sure to read the conclusion section before making any decisions.
Considerations and planning
When starting any large project, it’s important to define your goals. They will guide you through the entire process and help you make decisions.
- Goals are what you want to achieve with the project. They are meant to be broad and high-level.
- Requirements translate the goals into concrete actions. They may also include additional constraints.
- Scope sets the boundaries of the project. I find it useful to define that boundary by listing what’s out of the scope.
Let’s start with the goals.
Goals
As discussed in the intro, privacy and digital autonomy are key motivations for many people. They may have experienced one too many bad experience with enshittification or other egregious decisions made by the companies they rely on. They want to use services that are under their control, evolve at their pace, and not vanish because a company changed direction.
But for others, it can also start as a simple curiosity and a desire to learn: a way to deepen your understanding of Linux, networking, coding, or infrastructure by actually running the things you usually only read about. A homelab is a safe playground where you can experiment, break things, fix them, and come away with real skills.
The appeal can also be practical. Maybe you have a media library on a hard drive and you think it would be neat to be able to access it from anywhere, including on the go. Or a bunch of photos and videos you want to share with your family and friends. Maybe, like me, you want to have a personal corner of the internet where you can express yourself freely.
While typically not the main motivation, self hosting can also be cost saving, especially in the long run. This is particularly true if you or your family are reliant on a number of subscriptions or if you have copious amounts of data you need to store and access.
Those are some of the broader motivations that often bring people into the homelab/self-hosting world. With that in mind, here are the goals that guide the design of my setup (in order of importance):
- I want my data to be safe. I want to feel reassured that my data can’t be lost or corrupted.
- I want to keep things simple and reliable. I don’t want to spend all my free time troubleshooting or fixing things.
- At the same time, I want to experiment with new services and tools without feeling limited or scared to break existing things.
- I want to keep my homelab reasonably compact, quiet, and cost effective.
- I want to reduce my reliance on any specific provider and to be able to move effortlessly to another provider or offer if needed.
By clarifying your goals upfront, you create a roadmap for every decision you make—from choosing hardware to picking file systems, containers, and backup strategies. Every choice in this guide ties back to these motivations. Yours may differ, but having them written down keeps the project focused and intentional.
How to make services accessible from the internet
Before I start expanding on the requirements, I want to give you a concise explanation of how hosting works and the challenges associated with it. That way, we are all on the same page and the rest of the guide will make more sense. If you are already familiar with the topic, feel free to skip this section.
In this example, we will be hosting my blog, r-entries.com.
When a user, let’s say Alice, tries to access the blog, their browser will send a request to a DNS server to find the IP address of the website.
DNS is like a phone book for the internet. In this case, let’s say the returned IP address is 12.34.56.78.
Now that the browser knows the address, it will prepare and send a request to that address. In an ideal world, the server is directly accessible from the internet, so it receives the request, process it, and sends the response back to the browser.
The browser then receives the response and displays the content of the blog.
And that’s it! Quite simple, don’t you think?
No chicken or the egg dilemma here! DNS servers are pre-configured either by the home router, in the device settings or the browser settings. They are configured with an IP address and not a domain name, removing the circular dependency.
But in reality, it’s highly unlikely that the server is directly accessible from the internet. When we say “IP” we typically mean IPv4. It’s the protocol used to uniquely identify devices on the internet. It was first deployed in the 1980s at a time where the “internet” was still a small community of researchers and universities. With a little over 4.2 billion possible addresses, it was far enough to cover the needs of the time. But as the internet grew, the possibility of running out of addresses became more and more imminent.
If the mid 90s, NAT was introduced to help alleviate the address exhaustion problem. Before the introduction of NAT, every connected device had a globally unique IPv4 address, just like the earlier example with Alice. With NAT, devices behind a router will be assigned a local IPv4 address, and the router itself will only use a single public IPv4 address.
Businesses typically have a single public phone number, instead of dedicated landlines for each employee. When you call it, the receptionist answers and then routes you to the right person. Internally, they may have a local network of phones, with internal numbers for each employee. These internal numbers are unknown to the public, and even if they were, they couldn’t be used to call an employee directly.
So NAT is really the same idea but applied to computers on the internet. Instead of a receptionist, you can configure the NAT table to forward different requests to different internal devices.
Let’s update the diagram to reflect the introduction of NAT:
When the router receives the request, it will look up the destination IP address in the NAT table.
If you configure it correctly, it will find the local IP address (e.g: 192.168.1.200) of the server and forward the request to it.
You’ll also notice that the router—sitting right on the edge between the internet and the local network—posseses both a public and a local IPv4 address. The internal address is used by devices within the local network to reach the router. Meanwhile, external devices use the public address to contact the router.
Unfortunately, this is not enough. With the adoption of mobile devices, the number of devices connected to the internet exploded. Mobile carriers had to find a way to accommodate for the flood of new devices and decided to introduce CGNAT. It’s carrier grade because it’s not done at the level of the home router, but at the level of an entire network of routers.
When applied to home ISPs, it means that you are no longer assigned a public IPv4 address, but a local one. Nested layers of local networks with NAT placed at each boundary to forward requests appropriately.
This doesn’t impact the average user, but it has major implications for self-hosting. The problem is that you don’t have access to the carrier’s NAT table. And even if you did, it basically means only one customer could make use of the public IPv4 address.
So… self-hosting is no longer possible in modern, IPv4-exhausted, internet?
There are actually several solutions to circumvent this issue:
-
Ask your ISP to move you from a CGNAT network and provide you with a dedicated public IPv4 address. They may tell you such feature is not available on your standard plan and you need to upgrade to a business-oriented plan. This typically costs several times more than a standard plan.
-
Support IPv6 only. IPv6 address space is so much larger than IPv4, it doesn’t have the same address exhaustion problem. With IPv6, say goodbye to NAT and CGNAT: every device has a unique global IPv6 address, just like it used to be with IPv4. It’s the future of the internet, but unfortunately, not all ISPs support IPv6 yet. The adoption is still under 50 % worldwide. If it continues at this rate, it may take another 15 years to reach near-universal adoption.
-
Use some form of tunneling. Tunneling is a technique to encrypt and forward traffic through a secure channel, crossing the NAT and CGNAT boundaries.
I won’t go into the details of how tunneling works, but if you ever used a VPN, that’s what they use. With a tunnel, you recreate a standard NAT setup, completely bypassing the CGNAT:
It exists a few different types of tunneling:
-
Reverse tunnel services like Cloudflare Tunnel. However, you have to trust the provider to not snoop on your traffic which is temporarily decrypted on their end. Personally, I’m not comfortable with this solution: I find their free-tier far too generous and that makes me doubt their intentions.
-
Use a VPN service like NordVPN, TorGuard, or Windscribe. Look for a provider that can offer a dedicated public IPv4 address (which typically comes with an extra cost).
-
Rent a VPS from a provider. No need to go into the details, what matters is that it’s a server that can be remotely accessed and controlled. This is the solution I’m going to use to make my services accessible from the internet. I’ll explain the reasoning behind this choice in the next section. Make sure they provide a public IPv4 address as part of the service.
And that’s it for the foundational networking knowledge. I hope you found it interesting and informative. It will be useful for understanding some of the decisions I’ll be making later on.
Requirements overview
To achieve the goals I’ve outlined earlier, I’m planning to use a 3 part system:
- A main server to host the services and store my data. It will be hosted at my place.
- A bastion server, the entrypoint to the homelab and its services. It will be hosted at a VPS provider.
- A backup server, where offsite backups will be stored. It will be hosted at my mom’s place.
What I mean by a bastion server is a server that is used to access the homelab and its services from the internet. As explained earlier, because of CGNAT and other network restrictions, the main server cannot be directly accessed from the internet. So we need a bastion server to act like a gateway and to tunnel the requests to the main server.
I could have choosen another type of tunneling, like a VPN service that offers a dedicated public IPv4 address. But I think a VPS is more sensible and flexible. Entry-level VPS starts under 5€/month, which is on par with VPN services. Beside, you can get a lot more from a VPS. It’s a full-fledged server after all. Also, unlike a VPN or Reverse tunnel service, you typically have full control over the server’s software. Sometimes you even get to install the OS yourself. In conclusion, a VPS seems like the best option among the available solutions.
And lastly we have the backup server. As I stated in my goals, data safety is my top priority. Having a backup server located elsewhere is a great way to protect against disasters or home break-ins. If the main server is stolen or destroyed, the backup server can be used to restore the latest backups. If the backup server is stolen or destroyed, the latest data is still safe on the main server.
In the following sections, I’ll explore different aspects of the project and which choice best aligns with my goals.
Seeing this 3 part system, I’m thinking of naming the servers Melchior (main), Balthasar (bastion), and Casper (backup). This is a reference to the Magi System from Neon Genesis Evangelion.
Operating System (OS)
Come back later :)
Virtual Private Server (VPS)
I’m going to use a VPS provider for the bastion server. All I really need is a server with:
- Debian 13
- SSH access
- a public IPv4 address
- at least 2 vCores
- at least 2 GB of RAM
- at least 20 GB of storage
- a European company and data center
My internet speed is pretty fast with 2 300 Mbps download and about 1 000 Mbps upload. A provider with 1 Gb/s speed would be ideal for fast downloads. But I’m also thinking that a lower speed—on top of reducing the cost—would reserve more bandwidth for my own home internet usage.
| Offer | CPU | RAM | Storage | Speed | Price (no commitment) | Price (12 months commitment) |
|---|---|---|---|---|---|---|
| Contabo (VPS 10) | 4 vCores | 8 GB | 75 GB | 200 Mb/s | 4.50 € / month | 3.60 € / month |
| IONOS (VPS Linux S) | 2 vCores | 2 GB | 80 GB | 1 Gb/s | 6.80 € / month | 4.20 € / month |
| OVH (VPS-1) | 4 vCores | 8 GB | 75 GB | 400 Mb/s | 5.39 € / month | 4.58 € / month |
| OVH (VPS-2) | 6 vCores | 12 GB | 100 GB | 1 Gb/s | 8.39 € / month | 7.14 € / month |
| o2switch (Grow) | 8 vCores | 16 GB | Unlimited | 1 Gb/s | Not available | 8.40 € / year |
In the end, I decided to go with OVH. Specifically their VPS-1 offering:
- CPU: 4 vCores
- Memory: 8 GB
- Storage: 75 GB (NVMe)
- Bandwidth: 400 Mbit/s
- Traffic: Unlimited
- Price: 5.39 € / month (no commitment)
- Backups: Daily, keep one (keep 7 for 2.20 € / month)
- Snapshots: No snapshots (one snapshot for 0.60 € / month)
- SLA: 99.9 % uptime
- OS: Fedora (40, 42), AlmaLinux (8, 9, 10), CloudLinux (9), Debian (11, 12, 13), Rocky Linux (8, 9, 10), Ubuntu (22.04, 24.04, 24.10, 25.04), FreeBSD (14.3)
- Comes with a public IPv4 and IPv6 addresses
- SSH access
Tunneling solution
On my homelab, I would like to host two types of services:
- Private services (such as Immich) that should only be accessible by a limited number of users (mostly me).
- Public services (such as r-entries.com) that should be directly accessible without any authentication.
This is typically handled by a reverse proxy, a service that act like an entrypoint for other services. A reverse proxy typically handles routing, filtering requests, authentication, TLS/HTTPS, logging, and more.
On the other hand, the tunneling solution is responsible for creating a secure channel between two servers, the bastion and the main server in our case.
Both functions can be completely separate, but in my case I think I’m going to use Pangolin, a tunneled reverse proxy.
Setup is pretty straightforward:
- On the bastion server, you install Pangolin.
- On the main server, you install Newt, a lightweight WireGuard client.
Even if you are behind a CGNAT, you can now access your services from the internet. By default, Pangolin consider all services as private. When you try to reach a service, you will be redirected to the Pangolin login page. Once logged in, the URL reverts to the service you were trying to access.
While Pangolin also offers a Cloud option, we will focus on the self-hosted version where everything is handled locally.
Authentication is verifying who the user is. Authorization determines what the user is allowed to do. Both can be used in tandem, or you can have one without the other.
You have a full range of options to either limit or extend access to your services.
- You can make services completely opened to the public.
- You can ditch authentication and just use authorization with a password, pin code, or a sharable link.
- Or keep it as default, requiring both authentication and authorization.
Furthermore, you can define ranked rules to either block or allow access from specific IPs, geolocation, URL paths, and more.
If you choose to keep authentication, you can have Role-Based Access Control (RBAC) to grant or block access to services based on the user’s role. This gives you further control over who can access what. And contrary to other solutions like Tailscale, you don’t need to install anything on the user’s side.
It can also work in tandem with CrowdSec, a modern, open-source, collaborative behavior detection engine, integrated with a global IP reputation network. It functions as a massively multiplayer firewall, analyzing visitor behavior and responding appropriately to various types of attacks.
Virtualization and containerization
Virtualization offers strong isolation and the ability to run multiple operating systems on the same hardware. However it also adds extra overhead and complexicity.
Containerization offers a lighter alternative to virtualization. They are fast to deploy, there’s barely any overhead, and they are easy to manage. However, they don’t offer the same level of isolation as virtualization.
In my previous homelab, I was running XCP-ng as the hypervisor. Then I had a few VMs running on it. And each would use Docker containers to run services. But all the VMs were running Debian, so I’m unsure if it was worth the complexity of running a hypervisor. Also I think having to choose how much ressource to allocate to each VM was a bit of a pain, especially when you realize you need to expand the VM’s storage.
Thus, my plan is to run Debian on bare metal on all three servers. And use Docker containers to run services. If in the future I decide that virtualization is needed, I can always install Proxmox VE on the main server. Proxmox is already shipped with Debian so installing it should be simple enough. They even have a guide for that.
File system
Introduction to ZFS
So I’ve heard of ZFS before, but I’ve always thought it looked complicated and intimidating. Probably because it does a lot more than your average filesystem.
ZFS, or more specifically OpenZFS, describes itself as an “open-source storage platform”. Here’s a few of its key features:
- Protection against hardware failure with mirroring and RAIDZ.
- Protection against data corruption with integrity verification and automatic repairs.
- Built-in snapshots and replication.
- Support for massive files and massive storage capacities (ZFS stands for Zettabyte File System after all).
- Support transparent compression and hardware-accelerated native encryption.
Protection against hardware failure
Here’s a quick overview of how it works. What’s exposed as storage to the system is pools. Pools may include one or more vdevs. A vdev may include one or more disks of the same capacity.
A vdev can be of multiple types:
- single: it’s just a single disk. Offers no redundancy but still detects data corruption.
- mirror: two or more disks are mirrored, meaning the exact same data is written to all disks. If one disks fails, the continues to operate normally while you replace the faulty disk.
- RAIDZ1: three or more disks are used to store data. Parity is computed for each strip (block of data) and then distributed across the disks. If one disk fails, the vdev continues to operate normally while you replace the faulty disk. The total usable capacity is the total capacity of the disks minus one disk worth of parity. This is similar to RAID5.
- RAIDZ2: similar to RAIDZ1 but with two parity strips. This means two disks can fail before data loss occurs.
In general:
- Mirror requires at least 2 disks to operate. Mirroring with
Ndisks can sustainN-1disk failures. The total usable capacity is the capacity of just one of the disks. - RAIDZ
Nrequires at leastN+2disks to operate and can sustainNdisk failures. The total usable capacity is the sum of the disks’ capacities minusNdisks worth of parity.
For exemple, in RAIDZ2, you need at least 4 disks. Let’s say you have 6 disks with 1 TB of capacity each. The total usable capacity is 4 TB (1 TB × 6 disks - 2 TB of parity). You could lose 2 disks before data loss occurs.
Anytime a faulty disk is replaced, ZFS will automatically start recreating the data that was on the faulty disk using the other disks. In the end, the new disk will be a perfect copy of the faulty disk. This process is called resilvering.
Resilvering can take a while, especially when dealing with terabytes of data. It also depends on the topology of the vdev, and how much the disks are utilized during the process. It’s not uncommon to see resilvering take days.
If you choose a vdev type where only one disk can fail, this resilvering period can be very stressful. It’s unlikely for another disk to fail during this period, but it’s still a risk. And it happens, all the data is irrecoverable. This is why with larger, more numerous disks it’s highly recommended to choose at least RAIDZ2.
Once a vdev is created, you can’t change its type. For exemple, you can’t change a RAIDZ1 to a RAIDZ2.
Integrity checks and automatic repairs
Protection against hardware failure is cool and all but I think integrity checks and automatic repairs really make ZFS stand out. It is achieved by storing a checksum for each block of data. During read operations the checksum is verified. If the checksum does not match, then ZFS knows that the block is corrupted. It will then attempt to repair the data:
- In the case of mirroring, ZFS will use the first copy of the data where the checksum matches to repair the data.
- In the case of RAIDZ, parity data is used to repair the data.
The same way that RAIDZ
Ncan sustainNdisk failures, ZFS can repair a corrupted block as long as one copy of the parity data is available. If more thanNcopies of the data and its parity are corrupted, then the data is irreparable.
Even if ZFS cannot repair the data, it will at least notify you that a block is corrupted.
Furthermore, ZFS doesn’t just wait for data to be read to verify the checksum. It also periodically perform integrity checks and automatic repairs of all the data. This process is called scrubbing.
Extensibility
ZFS mirror and RAIDZ can be expanded by adding new disks to the vdev. In the case of RAIDZ, this is a recent feature that was introduced in OpenZFS 2.2.
You can also replace the disks in a vdev with larger ones. You need to do that one disk at a time and resilver the vdev after each replacement.
Finally, you can simply add more vdevs to a pool.
Hot spares
Disks can be assigned as hot spares for a pool: if a disk fails in any of the vdevs, resilvering can immediately start on one of the spare disks. This can be useful if you can’t readily access the server to replace the faulty disk (e.g: you’re on vacation).
However, if you only have one vdev, I would argue that it’s better to opt in for a more resilient vdev type (e.g: choosing RAIDZ2 instead of RAIDZ1).
In general, it’s still recommended to have a spare disk on hand in case of a hardware failure. This way, you can replace the faulty disk faster and the system will return to a nominal state faster.
Portability
Each disk in a pool has ZFS labels written at the start and end of the device. Those labels contain the pool name, vdev layout, and GUIDs.
So, if you decide to move the pool to another server, you just need to move the disks to the new server. Disk order does not matter (you don’t have to plug them into the same SATA ports, or even the same controller brand). As long as all the required disks are present, ZFS can figure out the correct configuration.
Compression
Another great feature is that ZFS can transparently compress the data. Transparent here means that the data is compressed on write and decompressed on read: the user never realizes that the data is compressed.
It’s not necessary to disable compression on datasets which primarily have incompressible data on them, such as folders full of video or audio files. ZFS is smart enough to not store data compressed if doing so wouldn’t save any on-disk blocks.
It may be counter-intuitive but compression can increase read and write performance. Sure the CPU has to compress and decompress the data, but smaller resulting files also means less data to read or write on the disk. If disk I/O is the bottleneck (e.g: you’re using HDDs), compression will increase effective I/O performance.
Backups
ZFS is uniquely suited for backups. It’s built-in replication and snapshots make it easy to restore data to a previous state.
Snapshots are a frozen, read-only view of the filesystem at a given point in time. The data is not duplicated, any modification after the snapshot creation is tracked separately. So, to restore the filesystem to a given snapshot, you just need to discard those modifications. They are instant to create and space-efficient.
They also offer good protection against user error and ransomware. The snapshots are immutable, so even if all your files get encrypted, you can still restore the filesystem to a previous snapshot.
Importantly, restoring a snapshot means restoring the entire filesystem to that point in time. This can be useful if an update went wrong or if a ransomware attack encrypted most files. If you only need to restore a specific file or folder, this can be achieve as well. Snapshots are read-only but they can be browsed and copied from just like any directory.
Replication is the process of copying snapshots from one ZFS pool to another. It can be used to backup the data to a remote location. Subsequent replications will only copy the differences between the previous snapshot and the current state of the filesystem. So it’s incremental and network-efficient.
Snapshots can be scheduled to be created at a given frequency. Retention policies can be set to automatically delete old snapshots. The same can be done for replications. You can decide to only replicate snapshots at a given frequency. Replication and retention policies can be set on a folder by folder basis if needed.
Snapshot frequency, replication frequency, and retention should reflect your tolerance for data loss. For me, I’m okay with losing a week of data if my home is burned to the ground or a thief breaks in. This means I don’t need replication to be any more frequent than once a week.
This is the proposed policy for the main server:
- A snapshot is created every hour.
- Hourly snapshots for the last 24 hours → rollback from accidental deletes or ransomware.
- Daily snapshots for the last 7 days → covers recent human error.
- Weekly snapshots for the last 4 weeks → medium-term rollback.
- Monthly snapshots for the last 6 months → long-term baseline.
- Replication to the backup server is done weekly.
On the backup server, I can have a longer retention policy:
- Weekly snapshots for the last 12 weeks.
- Monthly snapshots for the last 24 months.
- Yearly snapshots for the last 6 years.
The “3-2-1” rule
You’ve probably heard of the 3-2-1 backup rule. It’s a simple backup strategy that ensures you have at least 3 copies of your data, 2 on different media, and 1 offsite. So far, we have planned for 2 copies of the data, including 1 offsite.
There are three major media types:
- Magnetic storage: HDD, tape
- Optical storage: DVD, Blu-ray
- Solid-state storage: SSD, NVMe
We are already using magnetic storage in the form of HDDs in the main and backup servers. Beside the cost, it would be silly to use SSDs for backup purposes and not for the live data.
That leaves us with optical storage for the third copy. Blu-ray discs have gone down in price. They sit today at around 30 € per TB. The drive/burner itself cost below 100 €.
There is also the Optical Disc Archive technology, which promises larger capacity (up to 5.5 TB per cartridge) and longer lifespan (up to 100 years). But the technology is officially discontinued since 2023. The cartridges are not that much more expensive at 40 € per TB. But the drive itself is impossible to buy new, and the used ones are above 1 000 €.
Finally we have the M-DISC technology, which are basically Blu-ray discs with a different type of material that’s more durable. The claim is that they can last up to 100 years. The price is around 180 € per TB. Basically every BD drive can read M-DISC discs, but only compatible drives can burn them. The drive is around 120 €.
| Media | Price per TB | Price of the drive | Advertised longevity |
|---|---|---|---|
| Blu-ray | 30 € | 100 € | 10-20 years |
| Blu-ray M-DISC | 180 € | 120 € | 100 years |
| Optical Disc Archive | 40 € | 1 000 € | 100 years |
Based on the price per TB, if I have a replacement cycle every 10 years, regular BD would be more cost-effective for the first 60 years. In the next 60 years, it’s very likely that new technology will be released and make the M-DISC completely obsolete.
Beside, if I replace the set of discs every 10 years, it means I’ll technically have more than one copy of the data on optical discs. Beside having more copies, it will give me the opportunity to collect empirical data on the disc aging.
So in conclusion, I’ll be using Blu-ray discs for the third copy. Here’s the plan:
- Verbatim BD-R DataLifePlus 50GB 6X
- Burn them at low speed with verification and keep checksums of the data on the discs themselves and on the other storage media.
- Integrity checks of the discs every year.
- Replace the set of discs every 10 years.
- Continue monitoring discs from the previous set to learn more about their actual longevity.
- Keep an eye out for new technologies that could replace Blu-ray discs for the next replacement cycle.
Other tips:
- Store optical discs vertically in standard jewel cases without additional materials in the case.
- If you want to label the discs, use a water-based permanent marker on the clear inner hub, not the top surface.
- 20-50 % humidity and temperatures between 15-25 °C are ideal.
More info at this article on Optical Discs by the Canadian Conservation Institute.
Memory
Error-correcting code memory, commonly known as ECC Memory, is a type of RAM known for its data integrity and reliability. This specialized type of computer data storage uses a more sophisticated technology than standard RAM to detect and automatically correct internal data corruption.
The ECC procedure typically involves adding a few extra bits to each chunk of data stored in memory. These extra bits, known as parity bits, allow the system to determine if the data has been corrupted. If an error is detected, the ECC memory can often find the exact bit that is incorrect and restore it to the proper value without the user or the running program even realizing anything happened.
To use ECC memory, we need:
- A CPU that supports ECC
- A motherboard that supports ECC
- A memory stick that supports ECC
It needs support at all levels because ECC not only protects the data when in RAM, but also as it travels between the CPU registers and the RAM sticks.
Of course the tradeoff is that ECC memory is more expensive. Also, finding a consumer-grade CPU and motherboard that support ECC memory can be difficult.
ECC memory is one more protection against data corruption. It’s complementary to the other protection mechanisms provided by ZFS.
Uninterruptible Power Supply (UPS)
Come back later :)
Monitoring
As stated earlier, I’m unhappy about my lack of monitoring and alerting on the previous home lab. This time, I want to make sure things are working as expected and be alerted as soon as possible if something goes wrong.
Here’s a list of the metrics I want to monitor:
- Servers
- Ressource usage: CPU, RAM, storage, disk I/O, network
- Temperature: CPU, disks, room
- Disks SMART status
- System updates
- UPS
- Status
- Battery level,temperature, and health
- Temperature
- Costs
- Electricity cost
- ISP subscription
- VPS subscription
- Hardware cost
- CrowdSec
- Blocked requests count (total, by scenario)
- Blocked IPs (total, by scenario)
- Traefik
- Requests count (total, by domain)
- Requests latency (avg, max, min)
- Errors count (total, by status code)
- Geographic distribution (top 10 countries)
- ZFS
- Pool health, error counts and scrub status
- Snapshots jobs and current retention
- Replication jobs
- Services
- Ping or health-check uptime
- Image updates
Then I can setup alerts when those metrics are outside of acceptable ranges. For exemple:
- The scheduled scrub is not running on time
- Storage is almost full
- Temperatures have been above a threshold for some time
- Power surges or outages
- Services seem unavailable
Scope
Things that are out of scope:
- High availability
Come back later :)
Threats and mitigations
As a recap of everything I’ve mentioned so far, here’s a list of potential threats and how I plan to mitigate them:
-
Hardware threats
- Disk failures
- Covered by RAIDZ on both main and backup servers.
- SMART monitoring and alerts for signs of imminent failure.
- Alerting when a disk fails so that it can be replaced ASAP.
- Spare disks ready to swap in if needed.
- Data corruption
- Covered by ZFS block checksums.
- Self-healing by ZFS.
- Checked by monthly scrubbing on both main and backup servers.
- ECC memory participating in data integrity on the main server.
- UPS for clean shutdowns on both main and backup servers.
- Power outages and surges
- Covered by UPS on both main and backup servers.
- Bi-yearly UPS battery test.
- Overheating
- Monitor temps including disks temps.
- Monitor room temps and adjust fans accordingly.
- Use dust filters and clean them regularly.
- Ensure positive pressure in the computer cases to reduce dust accumulation.
- Disk failures
-
Provider threats
- ISP issues (CGNAT, IPv4 change, outage, blocking ports)
- Tunnel to the bastion server using Pangolin.
- Could still use my phone or a 4G/5G router as a backup.
- VPS provider failure
- Backups to the backup server.
- If the VPS is unreachable, I could relatively quickly switch to another VPS provider using the latest backup.
- ISP issues (CGNAT, IPv4 change, outage, blocking ports)
-
Malicious threats
- Bot attacks / DDoS
- Covered by the bastion server (Pangolin + CrowdSec).
- Reduce attack surface by making sure only a few ports are open.
- Ransomware
- Recoverable using local ZFS snapshots and remote backups.
- Vulnerabilities
- Keep packages and Docker images up to date.
- Enforce SSH key only authentication on all hosts.
- Enforce 2FA on publicly accessible services.
- Audit ~/.ssh/authorized_keys and /etc/sudoers regularly.
- Network segmentation between containers.
- Network segmentation between the home lab and other devices on the local network.
- Disaster / theft
- Use locks and chains to secure the servers and slow down / discourage thieves.
- Offsite backup
- Bot attacks / DDoS
-
Human threats
- Accidental deletion / overwriting
- Covered by local ZFS snapshots and remote backups.
- Disk misplacement during maintenance
- Use stickers to mark the disks and their assigned location.
- Refer to that label when alerting about a disk needing to be replaced.
- Forgetting my own setup
- Scripts to bring a fresh Debian server to a working state.
- Docker Compose and documentation for each service.
- Infrastructure-as-Code.
- Accidental deletion / overwriting
-
Operational risks
- Backup failures
- Monthly restoration test.
- Monitoring of ZFS replication jobs.
- Over-utilized resources: alerting
- Alerting failures
- Have weekly alerts just to confirm that things are working as expected.
- Monthly manual checks.
- Backup failures
Choosing and setting up the hardware
Buying prebuilt vs building your own
Come back later :)
Core components for the main server
I have a few requirements:
- I’ll start with 4 × 3.5” HDDs but in terms of upgradability, I’ll aim for 6 × 3.5” HDDs.
- Support for 2 × NVMe SSDs (or at least SATA SSDs)
- 32 GB of memory (upgradeable to 64 GB if needed)
- Efficiency is very important considering the system will run 24/7
- CPU performance should be sufficient. I like to use PassMark scores to compare CPUs. Let’s aim for a score of 20 000 or more.
And for the nice to have:
- ECC memory
- Hardware acceleration for video transcoding and AI workloads
CPU
Come back later :)
GPU
A GPU (dedicated or integrated) is not strictly necessary. But it can be useful for:
- Simply to display a video output (accessing the BIOS, debugging, etc…)
- Image processing (generating thumbnails)
- Video transcoding (Jellyfin, Plex)
- AI workloads (Immich face recognition, LLMs)
For video transcoding, Jellyfin recommends: - Intel Arc A series or newer - Nvidia GTX16/RTX20 series or newer (Excluding GTX1650) - AMD is NOT recommended.
NVIDIA provides a list of supported video encoding and decoding features for their GPUs. Overall, NVIDIA GPUs are pricier but well regarded for their gaming and video transcoding capabilities.
Entry-level Intel Arc GPUs are considered mediocre for gaming but provide excellent video transcoding capabilities considering their price, consumption, and size.
AMD GPUs are considered behind the competition when it comes to video transcoding. Even when the codecs are supported, performance or image quality may be lacking.
Depending on the choice of case, we may have to look for a low profile GPU.
Motherboard
For the motherboard, it’s not too difficult to find one with 6 SATA ports. If we want more SATA ports, we have a few options:
- Specialized motherboards with more SATA ports (rare, expensive)
- M.2 to SATA adapter (up to 6 SATA ports)
- PCIe to SATA adapter
- PCIe to SAS + SAS to SATA cable (typically 2 SAS ports, which means 8 SATA ports)
- If you really want to go crazy, you can use a PCIe x16 to 4x4 bifurcation cards to convert a single PCIe x16 slot to 4 × M.2 slots. Then you can use a M.2 to SATA adapter on each M.2 slot. This means 24 SATA ports.
Other recommendations:
- Update the BIOS before installing the CPU: they have been user reporting that their newly bought CPU (e.g: Ryzen 9000 series) got overvolted when installed on a motherboard that didn’t support the CPU without a BIOS update.
- Prefer ASRock or ASUS motherboards if you want to use ECC memory. Most other brands don’t support it.
Memory
CPU
- Ryzen 3000/4000
- Without integrated graphics: support ECC memory
- With integrated graphics:
- Pro variant: support ECC memory
- Non-Pro variant: no support
- Ryzen 5000
- Without integrated graphics:
- Ryzen 5 5500: no support
- Others: support ECC memory
- With integrated graphics:
- Pro variant: support ECC memory
- Non-Pro variant: no support
- Without integrated graphics:
- Ryzen 7000: support ECC memory
- Ryzen 8000
- Pro variant: support ECC memory
- Non-Pro variant: no support
- Ryzen 9000: support ECC memory
For the AM5 platform we have this handy community spreadsheet.
Motherboard
In general, Asus and ASRock motherboards support ECC memory. Other brands typically don’t.
For the AM5 platform we have this handy community spreadsheet.
Memory stick
Look for ECC UDIMM (Unbuffered DIMM).
- Avoid RDIMM (Registered DIMM), which is only available in workstation / server motherboards.
- Avoid “on-die ECC” only. It is a required feature on DDR5 memory sticks, but only corrects single-bit errors within the memory chip itself.
Storage
You can easily check what’s the best price for different storage types and capacities at diskprices.com.
For ZFS, it’s important that the hard drives are not SMR or else the performance will worsen. This can make resilvering much slower. You can use this list to check if disk is not SMR. CMR is the preferred type.
Shucking drives
https://www.ifixit.com/Guide/How+to+Shuck+a+WD+Elements+External+Hard+Drive/137646
PSU
Considering the rest of the requirements, we can expect the system to draw a maximum of 400 W. This is because HDDs will draw significantly more power at startup than during normal operation.
During normal operation, the system should draw around 120 W.
Any PSU rated for 550 W or higher should be sufficient.
Obviously, efficiency is important for a home lab that will run 24/7. For example, the be quiet! PURE POWER 12 M 650 W advertises a 80+ Gold rating. This means:
- at 20 % load, the efficiency is at least 90 %.
- at 50 % load, the efficiency is at least 92 %.
- at 100 % load, the efficiency is at least 89 %.
So the efficient should be above 90 % most of the time, which is great. I would recommend at least a 80+ Gold rating.
It can be difficult to find a PSU with sufficient amount of SATA or PATA (aka molex) cables out of the box. If the PSU is modular, you should check if the PSU has enough “Drives” / “SATA/PATA” ports. If so, maybe the manufacturer provided cables that can be purchased separately.
It’s also possible to buy compatible SATA/PATA cables from cablemod for about 10 € per cable, 15 € for shipping.
Considering that I’m not likely to use more than 6 × 3.5” HDDs, I don’t have to worry about how much current is going through the 12 V lanes. Research into it if you’re using more disks.
TODO: add my research on power efficiency
Cases
The complexity is to find a case that will accomodate at least 6 × 3.5” disks. This is becoming increasingly difficult as nowadays people will likely use a NVMe SSD and maybe one or two 3.5” HDDs.
They are some specialized “PC NAS” cases, but they are typically quite expensive and value compactness. This means that the motherboard will be Mini-ITX or Micro-ATX, which are typically more expensive and provide less slots and features.
Then there are rack mounted cases, which are also expensive. Typically, you can use any size of motherboard but you are still limited by the height of the case. This means the CPU cooler and the GPU need to be low profile. You can easily find cases with at least 6 × 3.5” bays and some with hot-swappable bays on the front of the case which makes maintenance very convenient.
Tower cases
Here’s a few I like:
-
Jonsbo N3
- 8 × 3.5” HDD bays
- Mini ITX motherboard
- SFX PSU (up to 105 mm in length)
- Use a HDD backplate, needs 2 molex connectors to power all drives
- 2 × 100 mm fans + 2 × 80 mm fans
- 250 mm wide, 210 mm tall, 374 mm deep
-
Fractal Design Node 304
- 6 × 3.5” HDD bays
- Mini ITX, Mini DTX motherboard (avoid angled SATA connectors)
- ATX PSU (up to 160 mm in length)
- No HDD backplate
- 2 × 92 mm fans + 1 × 140 mm fan
- 233 mm wide, 298 mm tall, 262 mm deep
-
Fractal Design Node 804
- 10 × 3.5” HDD bays + 2 × 2.5” SSD bays
- Micro ATX, Mini ITX motherboard
- ATX PSU (up to 260 mm in length)
- No HDD backplate
- 2 × 92 mm fans + 1 × 140 mm fan
- 344 mm wide, 307 mm tall, 389 mm deep
Rack mounted
- “19-inch” is the most common rack width. The minimum opening width is 450 mm and the width of the cabinet is 24 inches (600 mm).
- Rack mounted equipment also comes in standardized U height units. 1U is 1.75 inches (44.45 mm).
- A 1U server is 44.45 mm tall.
- A 2U server is 88.9 mm tall.
- A 3U server is 133.35 mm tall.
- A 4U server is 177.8 mm tall.
- Racks also comes in U units. So you buy a 16U rack, you can put 16U worth of equipment in it.
- Racks depth is not standardized. If the rack is enclosed, you have to make sure the equipment fits.
Here’s one I like:
- Inter-Tech IPC 3U-3508
- https://www.inter-tech.de/productdetails/3U-3508_EN.html
- 8 × 3.5” hot-swappable HDD bays
- 2 × 2.5” internal SSD bays
- Mini-ITX, Micro-ATX, ATX motherboards
- ATX PSU
- 2 × 80 mm fans + 2 × 60 mm fans
- max. 100 mm tall CPU cooler / GPU
- max. 244 mm long GPU
- 480 mm wide, 132 mm tall, 528 mm deep (3U rack)
Complete builds
Build A: cheapest AM4, no ECC
| Component | Name | Price |
|---|---|---|
| Rack | Inter-Tech IPC 3U-3508 | 200 € |
| Motherboard | MSI B550-A PRO | 100 € |
| CPU | Ryzen 5 5500 | 80 € |
| Cooler | Noctua NH-L12Sx77 | 80 € |
| RAM | Integral 32 GB DDR4-3200 | 70 € |
| GPU | Sparkle Arc A380 GENIE | 130 € |
| PSU | be quiet! Pure Power 12M 750 W | 130 € |
| NVMe | 2 × Crucial P310 SSD 500 GB | 100 € |
| HDDs | 4 × WD Ultrastar DC HC320 8 TB | 530 € |
| Total | 1 420 € |
Note: Ryzen 5000 series processors can support DDR4-3200 memory with two sticks but only 2 666 MHz on 4.
Build B: AM4, with ECC memory
| Component | Name | Price |
|---|---|---|
| Rack | Inter-Tech IPC 3U-3508 | 200 € |
| Motherboard | ASROCK B550 Pro4 | 100 € |
| CPU | Ryzen 7 5700X | 150 € |
| Cooler | Noctua NH-L12Sx77 | 80 € |
| RAM | Timetec Hynix IC DRR4-2666 2x16 GB | 100 € |
| GPU | Sparkle Arc A380 GENIE | 130 € |
| PSU | be quiet! Pure Power 12M 750 W | 130 € |
| NVMe | 2 × Crucial P310 SSD 500 GB | 100 € |
| HDDs | 4 × WD Ultrastar DC HC320 8 TB | 530 € |
| Total | 1 520 € |
Build C: AM5, with ECC memory
| Component | Name | Price |
|---|---|---|
| Rack | Inter-Tech IPC 3U-3508 | 200 € |
| Motherboard | ASROCK B850 Pro RS | 200 € |
| CPU | Ryzen 7 9700X | 280 € |
| Cooler | Noctua NH-L12Sx77 | 80 € |
| RAM | Kingston Server Premier 32 GB DDR5-5600 ECC | 200 € |
| GPU | Sparkle Arc A380 GENIE | 130 € |
| PSU | be quiet! Pure Power 12M 750 W | 130 € |
| NVMe | 2 × Crucial P310 SSD 500 GB | 100 € |
| HDDs | 4 × WD Ultrastar DC HC320 8 TB | 530 € |
| Total | 1 850 € |
Let’s compare the two ECC options, with similar class CPUs:
| Criteria | Build A | Build B | Build C |
|---|---|---|---|
| Total price | 1 420 € | 1 520 € | 1 850 € |
| Core / Thread | 6 / 12 | 8 / 16 | 8 / 16 |
| Single thread score | 3 059 | 3 385 | 4 656 |
| CPU score | 19 321 | 26 608 | 37 180 |
| PCIe slot | 4.0 | 4.0 | 5.0 |
| RAM frequency | 3 200 MHz | 2 666 MHz | 5 600 MHz |
| ECC | ❌ | ✅ | ✅ |
AM4 is almost 10 years old now—and while the platform continues to be supported with new releases like the Ryzen 5 5500X3D in June 2025—AMD has mostly transitioned to AM5. The latter is newer, being released in September 2023. AMD officially stated supporting AM5 throughout 2027. Thus, the newer platform is more future-proof.
The motherboard offers a PCIe 5.0 slot and M.2 Gen5 whereas AM4 is limited to PCIe 4.0 and M.2 Gen4. We also have a 2.5 Gb/s LAN port instead of a 1 Gb/s one.
We can start with 4 HDDs in RAIZ2, which gives us 16 TB of usable storage. This can be expanded in the future to
- 24 TB with 5 HDDs
- 32 TB with 6 HDDs
- 40 TB with 7 HDDs
- 48 TB with 8 HDDs
Core components for the backup server
I’m going to use a ZimaBlade 3760, an inexpensive single board x86 computer. The small form factor is reminiscent of the Raspberry Pi, but it as many advantages over it:
- The price is basically the same as a Raspberry Pi 4, but it’s way more powerful.
- Same power consumption at IDLE, slightly higher than a Raspberry Pi 4 when under load.
- x86 architecture which means it’s highly compatible.
- It has a PCIe 2.0 x4 slot, 2 SATA ports, and a DDR4 SO-DIMM slot for extensibility.
- It has a regular UEFI BIOS, which makes installing another OS easy.
So as long as the use case is a nice low-power server, this is a better choice than a Raspberry Pi. The Pi is better suited for a home automation server with it’s IO capabilities.
Networking and connectivity
- MikroTik Hex routers
Come back later :)
UPS
- BX1600MI-FR
Come back later :)
Setting up the physical hardware
Photos
- Take a picture of the drives before inserting them into the server. Note which bay they are in.
- I was surprised that the cooler wasn’t aligned with the CPU but offset by 0.7mm. I thought something was wrong but actually it’s because the CPU has a chiplet design with most of the heat being produced at the bottom of the chip. See this video for more details.
- BIOS Flashback: https://www.youtube.com/watch?v=fqKs9fekNNY
Drive location for the main server:
| Row | Column 1 | Column 2 |
|---|---|---|
| Row 1 | VDJR3AGK | VDJRVSZK |
| Row 2 | — | — |
| Row 3 | — | — |
| Row 4 | VDKTSX3K | VDJRUKVK |
SSD location for the main server:
- The CT500P310SSD8 is placed at the bottom right M.2 slot, right under the chipset.
- The Samsung SSD 970 EVO Plus 500GB is placed on a PCIe to M.2 adapter card.
Show diagrams of the networking and the power delivery through the UPS + smart plug.
Configuring the software
Base OS and environment
Installing Debian
Make another post on how to install and use Ventoy
- Download the netinstall image
- Burn it to a USB stick
- Connect the USB stick to the server, make sure the Ethernet cable is also connected.
- Boot into the installer. Select either “Install” or “Graphical install”.
- Language: English
- Locale: United States (en_US)
- Keymap: French
- Hostname: melchior
- Keep domain name empty
- Provide a password for the root user
- Confirm the password
- Create a new user: [USERNAME]
- Provide a password for the [USERNAME] user
- Use “Guided: use entire disk” to partition the disk
- Select the right disk to use, based on the size and name of the disk
- Select “All files in one partition (recommended for beginners)”
- Write changes to disk: yes
- Keep default settings for the installation of the packages
- Software selection: only keep “SSH server” selected
- Install GRUB boot loader to your primary drive: yes
- Select the same disk as before for the boot loader installation
- When it says “Installation complete”, remove the USB stick and press continue to reboot the server
Set static IP
nano /etc/network/interfacesYou’ll see something like this:
# The primary network interface
allow-hotplug enp6s0
iface enp6s0 inet dhcpReplace the dhcp with static and add the following lines:
# The primary network interface
allow-hotplug enp6s0
iface enp6s0 inet dhcp
iface enp6s0 inet static
address 192.168.70.67
netmask 255.255.255.0
gateway 192.168.70.1Reboot the server to apply the changes.
Synchronize time
apt-get install systemd-timesyncd
systemctl enable systemd-timesyncd --nowYou can check it’s synchronized with:
timedatectl Local time: Sun 2025-11-02 16:15:39 CET
Universal time: Sun 2025-11-02 15:15:39 UTC
RTC time: Sun 2025-11-02 15:15:38
Time zone: Europe/Paris (CET, +0100)
System clock synchronized: yes
NTP service: active
RTC in local TZ: noSetup SSH and local login
SSH should already be installed and running because we selected it during the installation. But let’s secure it further.
Use a SSH key instead of a password
On your personal computer, generate a new SSH key:
ssh-keygenGive it a name: id_melchior
I would also recommend to use a passphrase for the key as a second factor of authentication. If your device is ever stolen, the attacker will not be able to use the key without the passphrase.
Copy the public key to the server:
ssh-copy-id tom@192.168.70.67Test connecting with the key.
ssh -i ~/.ssh/id_melchior tom@192.168.70.67If everything is working, we can now disable password authentication.
nano /etc/ssh/sshd_config- Uncomment
LoginGraceTimeand set it to20s - Uncomment
PermitRootLoginand change it tono - Uncomment
StrictModes yes - Uncomment
MaxAuthTriesand set it to3 - Uncomment
PubkeyAuthentication yes - Add
AuthenticationMethods publickeyright below - Set
PasswordAuthenticationtono - Keep
KbdInteractiveAuthentication no - Keep
UsePAM yes - Set
X11Forwardingtono
Validate syntax first:
sshd -tRestart the service:
systemctl restart sshIn case something goes wrong, we want to be able to revert the changes. Only close the SSH connection if you validated everything is working.
Try in a new terminal to connect without the key:
ssh tom@192.168.70.67It should say something like:
tom@192.168.70.67: Permission denied (publickey).Now try with the key:
ssh -i ~/.ssh/id_melchior tom@192.168.70.67Enter the passphrase if you used one.
If it works, we are good to go. We know that password authentication is disabled and that we can still connect with the key.
To make it easier to connect, we can add a shortcut to the ~/.ssh/config file.
nano ~/.ssh/configAdd the following:
Host melchior
Hostname 192.168.70.67
User tom
IdentityFile ~/.ssh/id_melchiorNow you can connect with just:
ssh melchiorChange the non-root and root passwords
passwdConsider that you may have to enter these passwords without access to the clipboard. I would recommend using passphrase for these passwords to make entering them easier while keeping the strength high.
Containerization setup
Installing Docker
apt-get update
apt-get install ca-certificates curl
install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc
chmod a+r /etc/apt/keyrings/docker.asc
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/debian \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
tee /etc/apt/sources.list.d/docker.list > /dev/null
apt-get update
apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-pluginYou can check the status of the Docker service with:
systemctl status dockerContainer orchestration
What I’m looking for in a Docker management tool/orchestrator:
- List containers, images, networks, volumes
- View utilization stats: CPU, memory, storage, network
- View logs, enter terminal inside the container when possible
- Create, view, edit projects (docker compose)
- Perform simple operations like build, start, pause, stop, restart, on containers and projects
- List and get notified of image updates
- (Nice to have) Support for multiple hosts
Come back later :)
Configuration for the main server
Install ZFS
Add backports to your sources.list:
apt install lsb-release
codename=$(lsb_release -cs);echo "deb http://deb.debian.org/debian $codename-backports main contrib non-free"|tee -a /etc/apt/sources.listInstall the packages:
apt update
apt install linux-headers-amd64
apt install -t stable-backports zfsutils-linuxPress Enter in the license agreement prompt.
Create the HDD pool
Get the list of disks’ UUIDs:
ls -l /dev/disk/by-id/lrwxrwxrwx 1 root root 9 Sep 21 13:41 wwn-0x5000cca0bbe63bb3 -> ../../sdd
lrwxrwxrwx 1 root root 9 Sep 21 13:41 wwn-0x5000cca0bbe68f2e -> ../../sdc
lrwxrwxrwx 1 root root 9 Sep 21 13:41 wwn-0x5000cca0bbe693ad -> ../../sdb
lrwxrwxrwx 1 root root 9 Sep 21 13:41 wwn-0x5000cca0bbf58929 -> ../../sdaDetermine the right ashift value:
lsblk -o NAME,PHY-SeC,SIZE,TYPE /dev/sd[a-d]NAME PHY-SEC SIZE TYPE
sda 4096 7.3T disk
sdb 4096 7.3T disk
sdc 4096 7.3T disk
sdd 4096 7.3T diskWe see that the PHY-SEC is 4 096. This gives us ashift=12 (2^12 = 4 096).
Create the RAIDZ2 pool (this will erase everything on these drives. Make sure there’s nothing you care about on them):
zpool create \
-o ashift=12 \
-O acltype=posixacl \
-O xattr=sa \
-O compression=zstd \
-O atime=off \
-O relatime=off \
-O normalization=formD \
-m /data \
data raidz2 \
/dev/disk/by-id/wwn-0x5000cca0bbe63bb3 \
/dev/disk/by-id/wwn-0x5000cca0bbe68f2e \
/dev/disk/by-id/wwn-0x5000cca0bbe693ad \
/dev/disk/by-id/wwn-0x5000cca0bbf58929You can add -f to force the creation of the pool (it may be necessary if the disks already have a filesystem on them).
Create the SSD pool
lsblk -fNAME FSTYPE FSVER LABEL UUID FSAVAIL FSUSE% MOUNTPOINTS
sda
├─sda1 zfs_member 5000 data 3786879799329643345
└─sda9
sdb
├─sdb1 zfs_member 5000 data 3786879799329643345
└─sdb9
sdc
├─sdc1 zfs_member 5000 data 3786879799329643345
└─sdc9
sdd
├─sdd1 zfs_member 5000 data 3786879799329643345
└─sdd9
nvme1n1
nvme2n1
├─nvme2n1p1 vfat FAT32 1083-1133 965.3M 1% /boot/efi
├─nvme2n1p2 ext4 1.0 0201fa06-8444-490c-ba9d-c926c5010a94 393.7G 4% /
└─nvme2n1p3 swap 1 5c7b6301-be42-4e3f-9365-6672853ac470 [SWAP]
nvme0n1
└─nvme0n1p1 ext4 1.0 c4a339bb-78bc-4156-9232-70e9324a1ff1Here I can see that my 4 disks are listed as zfs_members.
The OS NVMe is nvme2n1 because that’s the one with the mounted partitions (boot and root).
So I’ll create a new pool with nvme0n1 and nvme1n1.
I can get their ids with this command:
ls -l /dev/disk/by-id/ | grep -w nvme[0-1]n1lrwxrwxrwx 1 root root 13 Oct 1 20:19 nvme-CT500P310SSD8_25164FAF1C4F -> ../../nvme0n1
lrwxrwxrwx 1 root root 13 Oct 1 20:19 nvme-CT500P310SSD8_25164FAF1C4F_1 -> ../../nvme0n1
lrwxrwxrwx 1 root root 13 Oct 1 20:19 nvme-eui.0025385691b4ebc8 -> ../../nvme1n1
lrwxrwxrwx 1 root root 13 Oct 1 20:19 nvme-eui.00a075014faf1c4f -> ../../nvme0n1
lrwxrwxrwx 1 root root 13 Oct 1 20:19 nvme-Samsung_SSD_970_EVO_Plus_500GB_S4EVNF0M698680B -> ../../nvme1n1
lrwxrwxrwx 1 root root 13 Oct 1 20:19 nvme-Samsung_SSD_970_EVO_Plus_500GB_S4EVNF0M698680B_1 -> ../../nvme1n1
For NVMe disks, it’s common to find multiple symlinks. Any of them should work. I’ll choose the human readable ones (CT500 and Samsung) without the “_1” suffix.
zpool create \
-o ashift=12 \
-O acltype=posixacl \
-O xattr=sa \
-O compression=zstd \
-O atime=off \
-O relatime=off \
-O normalization=formD \
-m /fast \
fast mirror \
/dev/disk/by-id/nvme-CT500P310SSD8_25164FAF1C4F \
/dev/disk/by-id/nvme-Samsung_SSD_970_EVO_Plus_500GB_S4EVNF0M698680BSetup ZFS snapshots and retention policy
Install sanoid:
apt install sanoidIf it doesn’t already exists, create the sanoid folder in /etc:
mkdir /etc/sanoidEdit the config file:
nano /etc/sanoid/sanoid.conf[hdd]
use_template = production
[ssd]
use_template = production
[template_production]
frequently = 0 # every 15 minutes (disabled)
hourly = 24 # keep 24 hourly snapshots
daily = 7 # keep 7 daily snapshots
weekly = 4 # keep 4 weekly snapshots
monthly = 6 # keep 6 monthly snapshots
autosnap = yes
autoprune = yesThen enable Sanoid with this command:
systemctl enable --now sanoid.timerCheck that it’s running properly with
systemctl status sanoid.timerList the snapshots with
zfs list -t snapshotSome info:
- Sanoid runs (by default) every 15 minutes via sanoid.timer.
- If autosnap is enabled, it will take snapshot periodically. The frequency is determined from the configuration (here, it will take a new snapshot every hour).
- Sanoid completely ignore manually taken snapshot when it comes to its snapshot sheduling and retention policy. It only cares about snapshot that follow its naming scheme.
- You can force Sanoid to take a snapshot right now:
sanoid -c /etc/sanoid/sanoid.conf --run hdd. This one will be subject to Sanoid retention policy. - You can test run the config using
sanoid --debug
Other useful ZFS commands
Rename a pool
Warning: export will unmount the pool. Here I’ll raname data to hdd.
zpool export data
zpool import data hddChange mountpoint
zfs set mountpoint=/services ssdImport pool from another server
If the pools were already created on another server, you can import them with:
zpool import pool: ssd
id: 4664139631714187601
state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
the '-f' flag.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-EY
config:
ssd ONLINE
mirror-0 ONLINE
nvme-CT500P310SSD8_25164FAF1C4F ONLINE
nvme-Samsung_SSD_970_EVO_Plus_500GB_S4EVNF0M698680B ONLINE
pool: hdd
id: 3786879799329643345
state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
the '-f' flag.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-EY
config:
hdd ONLINE
raidz2-0 ONLINE
wwn-0x5000cca0bbe63bb3 ONLINE
wwn-0x5000cca0bbe68f2e ONLINE
wwn-0x5000cca0bbe693ad ONLINE
wwn-0x5000cca0bbf58929 ONLINECreate the mounting points:
mkdir /data /servicesImport each pool:
zpool import -f hdd
zpool import -f ssdCheck that the pools are imported:
zpool listUpgrade a pool
When checking the status of a pool (e.g: zpool status ssd), you may see this message:
Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.This happens because ZFS running on the host is newer than when the pool was created. This is normal and it continue to work because ZFS is retrocompatible with older pools.
To upgrade the pool, run:
zpool upgrade ssdThis procedure is quick and safe. Only the pool metadata is upgraded, not the data. The only downside is that—while ZFS is retrocompatible—if a system runs a version of ZFS older than the pool, it will not be able to import it.
JavaScript
apt install nodejs npmIntegrate NVM in the mix
PM2 process manager
Install:
npm install -g pm2Logrotate is a module for PM2 that automatically manages and rotates log files to prevent them from consuming too much disk space.
pm2 install pm2-logrotateTo preserve the running services after a reboot:
pm2 startup
pm2 saveRun pm2 save after adding or removing services to preserve the configuration after a reboot.
Other useful tools
To monitor system resources:
apt install lm-sensors smartmontools btopTo easily explore the filesystem and see what’s taking up space:
apt install ncduConfiguration for the bastion server
- Follow the Base OS and environment section
- Follow the Containerization setup section
- No need for ZFS on the bastion server
Install Pangolin
curl -fsSL https://digpangolin.com/get-installer.sh | bash
./installerDo you want to install Pangolin as a cloud-managed (beta) node? (yes/no): no
Enter your base domain (no subdomain e.g. example.com): barillot.net
Enter the domain for the Pangolin dashboard (default: pangolin.barillot.net):
Enter email for Let's Encrypt certificates: letsencrypt@barillot.net
Do you want to use Gerbil to allow tunneled connections (yes/no) (default: yes):
=== Email Configuration ===
Enable email functionality (SMTP) (yes/no) (default: no):
=== Advanced Configuration ===
Is your server IPv6 capable? (yes/no) (default: yes):
=== Generating Configuration Files ===
Configuration files created successfully!
=== Starting installation ===
Would you like to install and start the containers? (yes/no) (default: yes):
Would you like to run Pangolin as Docker or Podman containers? (default: docker):Add captcha
We will configure the Traefik CrowdSec bouncer plugin to serve a Cloudflare Turnstile CAPTCHA challenge instead of a plain 403 when CrowdSec issues a captcha decision.
Get Turnstile API keys
- Log in to dash.cloudflare.com
- In the left sidebar, go to Turnstile
- Click Add widget
- Give it a name and add your domain(s)
- Copy the Site Key and Secret Key
Download the captcha HTML template
The bouncer plugin needs an HTML file to render the Turnstile widget. Download it into your Traefik config directory:
curl -o ./config/traefik/captcha.html \
https://raw.githubusercontent.com/maxlerebourg/crowdsec-bouncer-traefik-plugin/main/examples/captcha/captcha.htmlMount the HTML file into the Traefik container
In your docker-compose.yml, add a volume mount to the traefik service:
traefik:
volumes:
- ./config/traefik:/etc/traefik:ro
- ./config/letsencrypt:/letsencrypt
- ./config/traefik/logs:/var/log/traefik
- ./config/traefik/captcha.html:/captcha.htmlRestart the container:
docker compose down && docker compose upAdd captcha settings to the CrowdSec middleware
In ./config/traefik/dynamic_config.yml, add the following lines to the crowdsec plugin block:
http:
middlewares:
crowdsec:
plugin:
crowdsec:
# ... your existing config ...
captchaProvider: turnstile
captchaSiteKey: YOUR_SITE_KEY
captchaSecretKey: YOUR_SECRET_KEY
captchaGracePeriodSeconds: 1800
captchaHTMLFilePath: /captcha.htmlTest it
Add a temporary captcha decision against your own IP:
docker exec crowdsec cscli decisions add --ip YOUR_IP --duration 2m --type captcha --reason "testing turnstile"Configuration for the backup server
- Follow the Base OS and environment section
- Follow the Containerization setup section
Install ZFS
Add backports to your sources.list:
apt install lsb-release
codename=$(lsb_release -cs);echo "deb http://deb.debian.org/debian $codename-backports main contrib non-free"|tee -a /etc/apt/sources.listInstall the packages:
apt update
apt install linux-headers-amd64
apt install -t stable-backports zfsutils-linuxPress Enter in the license agreement prompt.
Create the backup pool
Get the list of disks’ UUIDs:
ls -l /dev/disk/by-id/
total 0
lrwxrwxrwx 1 root root 9 Mar 25 20:12 ata-Samsung_SSD_860_EVO_250GB_S3YJNX0M510252Z -> ../../sda
lrwxrwxrwx 1 root root 10 Mar 25 20:12 ata-Samsung_SSD_860_EVO_250GB_S3YJNX0M510252Z-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Mar 25 20:12 ata-Samsung_SSD_860_EVO_250GB_S3YJNX0M510252Z-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 Mar 25 20:12 ata-Samsung_SSD_860_EVO_250GB_S3YJNX0M510252Z-part3 -> ../../sda3
lrwxrwxrwx 1 root root 9 Mar 25 20:12 ata-WDC_WD200EDGZ-11CNKA0_SCH0RX2S -> ../../sdb
lrwxrwxrwx 1 root root 10 Mar 25 20:12 ata-WDC_WD200EDGZ-11CNKA0_SCH0RX2S-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 13 Mar 25 20:12 mmc-C9A551_0xec1635d0 -> ../../mmcblk0
lrwxrwxrwx 1 root root 15 Mar 25 20:12 mmc-C9A551_0xec1635d0-part1 -> ../../mmcblk0p1
lrwxrwxrwx 1 root root 15 Mar 25 20:12 mmc-C9A551_0xec1635d0-part2 -> ../../mmcblk0p2
lrwxrwxrwx 1 root root 15 Mar 25 20:12 mmc-C9A551_0xec1635d0-part3 -> ../../mmcblk0p3
lrwxrwxrwx 1 root root 9 Mar 25 20:12 usb-WD_Elements_25A3_5343483052583253-0:0 -> ../../sdb
lrwxrwxrwx 1 root root 10 Mar 25 20:12 usb-WD_Elements_25A3_5343483052583253-0:0-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 9 Mar 25 20:12 wwn-0x5000cca425ce6d7d -> ../../sdb
lrwxrwxrwx 1 root root 10 Mar 25 20:12 wwn-0x5000cca425ce6d7d-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 9 Mar 25 20:12 wwn-0x5002538e40fbbf28 -> ../../sda
lrwxrwxrwx 1 root root 10 Mar 25 20:12 wwn-0x5002538e40fbbf28-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Mar 25 20:12 wwn-0x5002538e40fbbf28-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 Mar 25 20:12 wwn-0x5002538e40fbbf28-part3 -> ../../sda3We are going to use the external HDD. I could use usb-WD_Elements_25A3_5343483052583253-0:0 or wwn-0x5000cca425ce6d7d.
zpool create \
-o ashift=12 \
-O acltype=posixacl \
-O xattr=sa \
-O compression=zstd \
-O atime=off \
-O relatime=off \
-O normalization=formD \
-m /backup \
backup \
/dev/disk/by-id/wwn-0x5000cca425ce6d7dSetup snapshots replication and retention policy
For this section, we will consider that both the main server and backup server are initially running on the same local network. We will then use Pangolin to tunnel the SSH connection between the two servers, making it possible to do remote backups. It should be possible to setup everything remotely from the get go if you configure the pangolin-cli first.
On the main server, create a new user
adduser --disabled-password --gecos "Syncoid replication user" syncoidGive this user the permissions required to replicate the ZFS pools:
zfs allow syncoid send,snapshot,hold hdd
zfs allow syncoid send,snapshot,hold ssdOn the backup server, create a new SSH key for the syncoid user (don’t set a passphrase):
ssh-keygen -f /root/.ssh/id_syncoidBecause we already blocked password authentication, we can’t use the ssh-copy-id command.
Instead we can manually copy the public key to the main server:
cat /root/.ssh/id_syncoid.pubPaste it on the main server:
mkdir -p /home/syncoid/.ssh
nano /home/syncoid/.ssh/authorized_keysSet the permissions:
chown -R syncoid:syncoid /home/syncoid/.ssh
chmod 700 /home/syncoid/.ssh
chmod 600 /home/syncoid/.ssh/authorized_keysAt this point, you should be able to connect to the main server from the backup server as the syncoid user:
ssh -i /root/.ssh/id_syncoid syncoid@192.168.70.67 zfs listLet’s add this connection to the ~/.ssh/config file:
nano ~/.ssh/configAdd the following:
Host melchior
Hostname 192.168.70.67
User syncoid
IdentityFile ~/.ssh/id_syncoidNow you can connect with just:
ssh melchiorInstall Sanoid and Syncoid on the backup server
Install sanoid:
apt install sanoidIf it doesn’t already exists, create the sanoid folder in /etc:
mkdir /etc/sanoidEdit the config file:
nano /etc/sanoid/sanoid.conf[backup/hdd]
use_template = backup
[backup/ssd]
use_template = backup
[template_backup]
hourly = 0
daily = 0
weekly = 12
monthly = 24
yearly = 6
autosnap = no
autoprune = yesNote the autosnap = no, it means that no snapshots will be taken automatically, it will only be replicated snapshots coming from the main server.
But we will still use Sanoid for its retention policy.
Then enable Sanoid with this command:
systemctl enable --now sanoid.timerCheck that it’s running properly with
systemctl status sanoid.timerYou can now try to run the replication process manually (this can take a while if the pools are large):
syncoid --recursive --no-privilege-elevation --no-sync-snap melchior:ssd backup/ssd
syncoid --recursive --no-privilege-elevation --no-sync-snap melchior:hdd backup/hddAutomating the replication process
On the backup server, create a script:
nano /root/run_syncoid.sh#!/bin/bash
set -e
echo "[1/4] Reconnect the external HDD"
zpool import backup
echo "[2/4] Replicate the snapshots from the main server"
syncoid --recursive --no-privilege-elevation --no-sync-snap melchior:ssd backup/ssd
syncoid --recursive --no-privilege-elevation --no-sync-snap melchior:hdd backup/hdd
echo "[3/4] Prune old/unwanted local snapshots"
sanoid --cron --verbose
echo "[4/4] Disconnect the Zpool and make the external HDD enter sleep mode"
zpool export backup
hdparm -y /dev/disk/by-id/wwn-0x5000cca425ce6d7dLet’s create the systemd services and timers:
nano /etc/systemd/system/syncoid.service[Unit]
Description=Syncoid backup to external drive
After=network.target
[Service]
Type=oneshot
ExecStart=/root/run_syncoid.sh
User=root
StandardOutput=journal
StandardError=journalnano /etc/systemd/system/syncoid.timer[Unit]
Description=Run Syncoid backup weekly
[Timer]
OnCalendar=weekly
Persistent=true
ini[Unit]
Description=Run Syncoid backup weekly
[Timer]
OnCalendar=weekly
Persistent=true
[Install]
WantedBy=timers.targetAnd finally enable and start the services and timers:
systemctl daemon-reload
systemctl enable syncoid.service
systemctl enable --now syncoid.timerTunneled SSH connection
Because we want offsite backups, we need to tunnel the SSH connection between the main server and the backup server. We already have Pangolin, so we can use it to tunnel the SSH connection.
In the Pangolin dashboard, in Networks -> Clients -> Machines -> Create client Give it a name, and select Docker as the “Operating system”. Copy the “Commands” and paste it in a docker-compose.yml file on the backup server.
Confirm the creation on the Pangolin dashboard and then run the container.
docker compose up -dThe client should be reported as “Connected” now.
Now let’s create a Private Ressource to allow the backup server to access the main server via SSH. In the Pangolin dashboard, in Networks -> Ressources -> Private -> Add ressource
Give it a name, select the main server site.
Because Newt is running in a docker container, the destination should need to be host.docker.internal.
Give it an Alias even if it will not be used (e.g: melchior.internal).
In Port Restictions, use TCP Custom 22, UDP Blocked. You can keep ICMP.
In the Access Policy tab, select the Machine Client we created earlier.
After saving, toggle the Alias Address column to see the IP address to use to connect to the main server.
Copy it and try connecting to the main server via SSH from the backup server:
ssh 100.96.128.8If it ask for a password, or give a normal SSH response, this is good sign.
Edit /root/.ssh/config and update the Hostname:
Host melchior
Hostname 192.168.70.67
Hostname 100.96.128.8
User syncoid
IdentityFile ~/.ssh/id_syncoidAnd this should now work remotely, even if the main server is not on the same network as the backup server:
syncoid --recursive --no-privilege-elevation --no-sync-snap melchior:ssd backup/ssd
syncoid --recursive --no-privilege-elevation --no-sync-snap melchior:hdd backup/hddHardening the SSH connection
Great, but in case the backup server is compromised, we want to make sure this SSH connection cannot be used to execute arbitrary commands on the main server. Let’s first log which commands are required by syncoid to run properly:
nano /home/syncoid/.ssh/authorized_keysAdd the following:
command="echo \"$SSH_ORIGINAL_COMMAND\" >> /home/syncoid/ssh_commands.log && exec /bin/bash -c \"$SSH_ORIGINAL_COMMAND\"",Run syncoid again from the backup server and check the log file:
cat /home/syncoid/ssh_commands.logexit
echo -n
command -v lzop
command -v mbuffer
zpool get -o value -H feature@extensible_dataset 'hdd'
zfs list -o name,origin -t filesystem,volume -Hr 'hdd'
zfs get -H syncoid:sync 'hdd'
zfs get -Hpd 1 -t snapshot guid,creation 'hdd'
zfs send -nvP -I 'hdd@autosnap_2026-03-23_23:30:27_weekly' 'hdd@autosnap_2026-03-28_10:00:10_hourly'
zfs send -I 'hdd'@'autosnap_2026-03-23_23:30:27_weekly' 'hdd'@'autosnap_2026-03-28_10:00:10_hourly' | lzop | mbuffer -q -s 128k -m 16MWe will now create a filter script to block all commands except the ones required by syncoid.
nano /home/syncoid/ssh_filter.sh#!/bin/bash
set -e
LOGFILE="/home/syncoid/ssh_commands.log"
CMD="${SSH_ORIGINAL_COMMAND#"${SSH_ORIGINAL_COMMAND%%[![:space:]]*}"}"
log() {
echo "$(date '+%Y-%m-%d %H:%M:%S') [$1] $CMD" >> "$LOGFILE"
}
if [ -z "$CMD" ]; then
echo "No command provided" >&2
exit 1
fi
case "$CMD" in
# Connection checks
"exit")
log "ALLOWED"
exit 0
;;
"echo -n")
log "ALLOWED"
echo -n
exit 0
;;
# Tool availability checks
"command -v lzop"|"command -v mbuffer")
log "ALLOWED"
exec /bin/bash -c "$CMD"
;;
# ZFS/zpool commands
"zfs "*|"zpool "*)
log "ALLOWED"
exec /bin/bash -c "$CMD"
;;
*)
log "BLOCKED"
logger -t syncoid-ssh "BLOCKED command: $CMD"
echo "Command not allowed: $CMD" >&2
exit 1
;;
esacMake sure the script cannot be edited by the syncoid user, but is still executable:
chown root:root /home/syncoid/ssh_filter.sh
chmod 755 /home/syncoid/ssh_filter.shEdit the authorized keys file again:
nano /home/syncoid/.ssh/authorized_keysReplace with the following:
command="~/ssh_filter.sh",On the backup server, run syncoid again and check if everything is still working.
systemctl start syncoid.serviceYou can check the logs with:
journalctl -u syncoidScrubbing the external HDD
Because the external HDD is only running when the replication process is running, it will never be available for scrubbing. Let’s schedule a monthly scrub of the pool.
On the backup server, create a script:
nano /root/run_scrub.sh#!/bin/bash
set -e
echo "[1/4] Reconnect the external HDD"
zpool import backup
echo "[2/4] Scrub the Zpool"
zpool scrub backup
echo "[3/4] Wait for scrub to complete"
while zpool status backup | grep -q "scrub in progress"; do
sleep 30
done
echo "[4/4] Disconnect the Zpool and make the external HDD enter sleep mode"
zpool export backup
hdparm -y /dev/disk/by-id/wwn-0x5000cca425ce6d7dLet’s create the systemd services and timers:
nano /etc/systemd/system/scrub.service/etc/systemd/system/scrub.service
ini[Unit]
Description=ZFS scrub on backup pool
After=network.target
[Service]
Type=oneshot
ExecStart=/root/run_scrub.sh
User=root
StandardOutput=journal
StandardError=journalnano /etc/systemd/system/scrub.timerini[Unit]
Description=Run ZFS scrub monthly
[Timer]
OnCalendar=monthly
Persistent=true
[Install]
WantedBy=timers.targetAnd finally enable and start the services and timers:
systemctl daemon-reload
systemctl enable scrub.service
systemctl enable --now scrub.timerUpgrading Newt
Because the backup server is remote and we can only access it via the Newt tunnel, updating Newt could lead to “cutting the branch on which one is sitting”. If we simply run
docker compose pull
docker compose down && docker compose upIf will stop the Newt tunnel and the SSH connection will stop mid command.
The docker compose up will not be executed.
There are a few solutions to this problem, we will use a simple systemd oneshot service to run the update process:
nano /etc/systemd/system/upgrade-newt.service[Unit]
Description=Upgrade Newt
[Service]
WorkingDirectory=/services/newt
Type=oneshot
ExecStart=docker compose pull
ExecStart=docker compose down
ExecStart=docker compose up -dWhenever a new version is available, you can run:
systemctl start upgrade-newtUnfortunately, you will still be disconnected:
Read from remote host 100.96.128.8: Connection reset by peer
Connection to 100.96.128.8 closed.
client_loop: send disconnect: Broken pipeBut you should be able to reconnect immediately. After you do, you can check that everything worked properly by checking the logs:
journalctl --since "1 hour ago" -u upgrade-newtDeploying services
Come back later :)
Troubleshooting
DNS resolution issues
Apparently at startup, if the container starts before the host’s DNS resolution is ready, it will fail, and sometimes it’s unrecoverable unless you restart the container. To fix this, you can tell docker to use the host’s DNS servers by adding the following to the docker compose file:
volumes:
- /etc/resolv.conf:/etc/resolv.conf:roAlternatively, you can also set DNS servers directly in the container’s network settings:
dns:
- 1.1.1.1
- 8.8.8.8Make sure to restart the container after adding the DNS servers.
docker compose down && docker compose upHardening
Come back later :)
services:
service-name:
image: image-name:latest
container_name: service-name
restart: unless-stopped
user: "1000:1000"
mem_limit: 512m
cpus: 0.5
pids_limit: 100
read_only: true
tty: false
stdin_open: false
cap_drop:
- ALL
security_opt:
- no-new-privileges:true
tmpfs:
- /tmp:rw,noexec,nosuid,nodev,size=128m,uid=1000,gid=1000,mode=1777
volumes:
- ./volume-name:/volume-name
env_file: .env
logging:
driver: loki
options:
loki-url: "http://localhost:3100/loki/api/v1/push"
loki-retries: 2
loki-max-backoff: 800ms
loki-timeout: 1s
keep-file: "true"
mode: "non-blocking"
networks:
- pangolin_net
networks:
internal_net:
name: internal_net
driver: bridge
internal: true
pangolin_net:
external: trueAbout tmpfs params:
noexec: Prevents execution of binaries.nosuid: Prevents escalation of privileges.nodev: Prevents character/block devices from being created.size: Limits the maximum size, help prevent a process from filling up the host memory.
Users, groups, and permissions
For each service, you should create a dedicated user:
groupadd --gid 2000 paperless
useradd --no-create-home --shell /usr/sbin/nologin --uid 2000 --gid 2000 service-nameStucture and permissions:
/services/service-name/
├── docker-compose.yml # owned by root:root and chmod 640
├── .env # optional, owned by root:root and chmod 600
└── mounts/ # optional, owned by root:root and chmod 750
└── mounts-name/ # owned by service-name:service-name and chmod 750
And then be careful with the permissions within the mounts themselves.
To set all files to 640 and directories to 750 recursively from the current directory, you can use:
chmod -R 640 .
chmod -R u+X,g+X .Ressource sharing
You can check the resource usage of the containers with:
docker statsNginx
server {
listen 80;
server_name _;
root /usr/share/nginx/html;
index index.html;
location / {
try_files $uri $uri/ =404;
}
}tmpfs:
- /tmp:rw,noexec,nosuid,nodev,size=128m,uid=1000,gid=1000,mode=1777
- /var/cache/nginx:rw,noexec,nosuid,nodev,size=128m,uid=1000,gid=1000,mode=0755
- /var/run:rw,nosuid,nodev,size=128m,uid=1000,gid=1000,mode=0755
volumes:
- ./default.conf:/etc/nginx/conf.d/default.conf:roMonitoring and alerting
The Prometheus/Grafana/Loki stack
Come back later :)
Configure Loki with Docker plugin
docker plugin install grafana/loki-docker-driver:3.3.2-amd64 --alias loki --grant-all-permissionsAdd the following to each docker compose service:
logging:
driver: loki
options:
loki-url: "http://localhost:3100/loki/api/v1/push"
loki-retries: 2
loki-max-backoff: 800ms
loki-timeout: 1s
keep-file: "true"
mode: "non-blocking"Avoid logging Loki or Grafana’s containers as it can create a feedback loop.
Sensors and IoT
- SONOFF Zigbee 3.0 USB Dongle Plus MG24, Gateway with EFR32MG24
- SONOFF S60ZBTPF Zigbee Smart Plug
- SONOFF SNZB-02D Mini ZigBee Smart Temperature Humidity Sensor
You should see the dongle listed in:
ls -l /dev/serial/by-idlrwxrwxrwx 1 root root 13 Mar 15 12:22 usb-SONOFF_SONOFF_Dongle_Plus_MG24_a4df9b88eba2ef11a7a7906661ce3355-if00-port0 -> ../../ttyUSB0Create a docker-compose.yml file:
services:
mosquitto:
image: eclipse-mosquitto
container_name: mosquitto
restart: unless-stopped
ports:
- 1883:1883
volumes:
- ./mosquitto:/mosquitto/data
networks:
- zigbee_net
zigbee2mqtt:
container_name: zigbee2mqtt
image: ghcr.io/koenkk/zigbee2mqtt
restart: unless-stopped
depends_on:
- mosquitto
volumes:
- ./mounts/data:/app/data
- /run/udev:/run/udev:ro
ports:
- 8080:8080
environment:
- TZ=Europe/Paris
devices:
- /dev/serial/by-id/usb-SONOFF_SONOFF_Dongle_Plus_MG24_a4df9b88eba2ef11a7a7906661ce3355-if00-port0:/dev/zigbee
networks:
- pangolin_net
- zigbee_net
networks:
zigbee_net:
name: zigbee_net
driver: bridge
internal: true
pangolin_net:
external: trueConnect to the frontend
- Select the dongle in the “Found Devices” dropdown
Documentation
Labeling the drives
In the event of a drive failure, it’s important to be able to identify the drive quickly. If you accidentally replace the wrong drive, you could end up with data loss. When a drive fails, ZFS will point out the faulty drive by its ID. So we need to match the ID with the drive serial number and location in the server.
zpool status pool: hdd
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0B in 02:06:26 with 0 errors on Sun Mar 8 02:30:27 2026
config:
NAME STATE READ WRITE CKSUM
hdd ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
wwn-0x5000cca0bbe63bb3 ONLINE 0 0 0
wwn-0x5000cca0bbe68f2e ONLINE 0 0 0
wwn-0x5000cca0bbe693ad ONLINE 0 0 0
wwn-0x5000cca0bbf58929 ONLINE 0 0 0
errors: No known data errors
pool: ssd
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0B in 00:00:35 with 0 errors on Sun Mar 8 00:24:44 2026
config:
NAME STATE READ WRITE CKSUM
ssd ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-CT500P310SSD8_25164FAF1C4F ONLINE 0 0 0
nvme-Samsung_SSD_970_EVO_Plus_500GB_S4EVNF0M698680B ONLINE 0 0 0
Let’s check what is the serial number of the drives:
apt-get install hdparmhdparm -I /dev/disk/by-id/wwn-0x5000cca0bbe63bb3 | grep Number:For NVMe you can use:
smartctl -a /dev/disk/by-id/nvme-CT500P310SSD8_25164FAF1C4F | grep Number:So now we can match the ID with the drive serial number.
| Zpool | ID | Model Number | Serial Number |
|---|---|---|---|
| hdd | wwn-0x5000cca0bbe63bb3 | HGST HUS728T8TALE604 | VDJR3AGK |
| hdd | wwn-0x5000cca0bbe68f2e | HGST HUS728T8TALE604 | VDJRUKVK |
| hdd | wwn-0x5000cca0bbe693ad | HGST HUS728T8TALE604 | VDJRVSZK |
| hdd | wwn-0x5000cca0bbf58929 | HGST HUS728T8TALE604 | VDKTSX3K |
| ssd | nvme-CT500P310SSD8… | CT500P310SSD8 | 25164FAF1C4F |
| ssd | nvme-Samsung_SSD_970_EVO… | Samsung SSD 970 EVO Plus 500GB | S4EVNF0M698680B |
To match the serial number with the location in the server, we need to carefully note in which bay the drive was inserted.
| Row | Column 1 | Column 2 |
|---|---|---|
| Row 1 | VDJR3AGK | VDJRVSZK |
| Row 2 | — | — |
| Row 3 | — | — |
| Row 4 | VDKTSX3K | VDJRUKVK |
And finally, we can do the same for the SSDs:
- The CT500P310SSD8 is placed at the bottom right M.2 slot, right under the chipset.
- The Samsung SSD 970 EVO Plus 500GB is placed on a PCIe to M.2 adapter card.
Now if a disk fails, let’s say wwn-0x5000cca0bbe63bb3, we will be able to know that its serial number is VDJR3AGK and it is located in the first row, first column. We can safely replace it with the spare disk and start the resilvering process.
Keeping a list of your service users and groups
Keep track of the users and groups you create for your services. If you have to recreate the server, the UID and GID will need to be exactly the same for the services to work. That’s why it’s important to explicitly set the UID and GID when creating the users and groups.
groupadd --gid 2000 paperless
groupadd --gid 2001 vaultwarden
groupadd --gid 2002 sure
groupadd --gid 2003 mqtt
usermod -aG dialout mqtt
useradd --no-create-home --shell /usr/sbin/nologin --uid 2000 --gid 2000 paperless
useradd --no-create-home --shell /usr/sbin/nologin --uid 2001 --gid 2001 vaultwarden
useradd --no-create-home --shell /usr/sbin/nologin --uid 2002 --gid 2002 sure
useradd --no-create-home --shell /usr/sbin/nologin --uid 2003 --gid 2003 mqttShowcase
Add photos and tests that confirm if the homelab is reaching my initial goals.
Conclusion
Somewhere, talk about NAT loopback / hairpin NAT / NAT reflection.
Reflections
The ASRock AM5 fiasco
So, about a month after I built the main server, I decided to reinstall the OS to make sure every step was perfectly documented. Unfortunately, the server never POSTed after that. The POST status checker LEDs for the CPU and DRAM were constantly on. I tried all the usual troubleshooting steps:
- unplugging the PSU and waiting a few minutes
- clearing the CMOS
- reflashing the BIOS
- trying another PSU
- taking the motherboard out of the case
- trying a minimal setup with only one stick of RAM, the PSU, CPU and motherboard
At that point I grew desperate. It was clear either the CPU, motherboard or RAM was faulty. I even took it to a local repair shop to get it checked. They didn’t go to great lengths to diagnose the issue, but on the other hand, they also didn’t charge me anything: so it’s hard to complain. Their conclusion was that the motherboard was at fault. I was also of the opinion that a motherboard failure was more likely.
After that, I resorted to contacting ASRock support. They told me to try isolating the faulty component by trying another known good CPU and RAM on the motherboard. Ideally yes, that’s also something I wanted to do, but this is my first AM5 system ever. And I wasn’t going to buy another CPU and RAM just to test it. They recommended me to contact the retailer I bought the motherboard from, which was Amazon.
I went to Amazon’s customer service to ask for a replacement motherboard. Unfortunately, the new motherboard was still not working. I still had the exact same symptoms.
In conclusion, the motherboard was not the issue. Could it be the CPU? The RAM? At that point, I finally decided to check online if anyone else was having the same issue with their AM5 system.
What I found was that I had been living under a rock. Since early 2025, there has been reports of Ryzen 9000 series CPUs failing, especially on ASRock motherboards. There is a megathread on ASRock’s subreddit that nicely summarizes the situation:
- Since February 2025, there has been pictures posted online of scorched CPU pins and motherboard sockets. The initial investigation suggested that only X3D CPUs and ASRock motherboards were affected. ASRock initially blamed the issue on user error or aggressive overclocking, but later made a statement that overvoltage due to bad PBO implementations could be the cause. They later released a few BIOS updates to address the issue.
- Unfortunately, since then, users still reported cases of CPUs failing on BIOS updates released after the initial investigation. Meaning the root cause is still not identified.
- Beside the scorched pins, the reported symptoms match my case perfectly: the system was working fine until a reboot. And since then, the POST status checker LEDs for the CPU and DRAM were constantly on. Seemingly, the motherboard is not damaged, it’s always a CPU failure.
- A Korean user reported having the same problem with a B850 Pro RS motherboard and a grand total of 3 Ryzen 7 9700X CPUs: so it seems the problem is not a rare defect in the CPU, but a systemic issue with the motherboard.
- Asus also had similar reports, and made an official statement about the issue.
- There has been a few cases of CPU failures on other motherboard brands, and even on the 7000-series CPUs. But the low report rate could indicate that these failures are within a normal defect rate.
In conclusion, even a year later, ASRock has not identified the root cause of the issue. I initially chose ASRock because I’ve had good experiences with their products in the past and—along with Asus—they are one of the few manufacturers that support ECC memory on consumer-grade motherboards.
While the failure rate is low, it’s still unacceptable that hardware costing hundreds of dollars could just burn or stop POSTing randomly. I’m not incline to be ASRock’s beta tester for this long-standing issue.
Also I’m a little salty that the ASRock technician responding to my ticket didn’t mention any of this to me. The symptoms I described match almost perfectly with the other users’ reports. I would have saved myself the hassle of getting a replacement motherboard just to send it back afterwards. Good thing Amazon’s return policy is pretty generous.
Stay away from 9000-series CPUs until reports of the issue are cleared. If you must buy one, avoid ASRock motherboards and Asus motherboards. Keep your motherboard BIOS up to date.
So what about the homelab? After this catastophic failure, I ended up canabalizing my gaming PC to keep the homelab running. Now the server is running on a Ryzen 5 5500 CPU and an Asus TUF Gaming B550-PLUS motherboard.
Other things
- Overkill much
- ECC seems overkill
- Useless second PCI-e slot because the PSU
- A slightly longer case would make building it easier and could provide a better airflow.
Future improvements
- Main server
- Upgrade to 64 GB of RAM
- Add a 5.5” AMOLED touch screen in the full-height 5.5” bay in the front of the rack mounted case
- Backup server
- Use ECC memory. This would require a new motherboard and memory sticks. The CPU already supports ECC.
The end is never the end is never…
- Experiment with NixOS
- Backup the password database to a USB drive securely