this post was submitted on 01 Aug 2024

42 points (97.7% liked)

Selfhosted

40160 readers

544 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 1 year ago

MODERATORS

Stop services while creating snapshots during backup? (lemmy.ca)

submitted 3 months ago* (last edited 3 months ago) by [email protected] to c/selfhosted

38 comments fedilink hide all child comments

It's fairly obvious why stopping a service while backing it up makes sense. Imagine backing up Immich while it's running. You start the backup, db is backed up, now image assets are being copied. That could take an hour. While the assets are being backed up, a new image is uploaded. The live database knows about it but the one you've backed up doesn't. Then your backup process reaches the new image asset and it copies it. If you restore this backup, Immich will contain an asset that isn't known by the database. In order to avoid scenarios like this, you'd stop Immich while the backup is running.

Now consider a system that can do instant snapshots like ZFS or LVM. Immich is running, you stop it, take a snapshot, then restart it. Then you backup Immich from the snapshot while Immich is running. This should reduce the downtime needed to the time it takes to do the snapshot. The state of Immich data in the snapshot should be equivalent to backing up a stopped Immich instance.

Now consider a case like above without stopping Immich while taking the snapshot. In theory the data you're backing up should represent the complete state of Immich at a point in time eliminating the possibility of divergent data between databases and assets. It would however represent the state of a live Immich instance. E.g. lock files, etc. Wouldn't restoring from such a backup be equivalent to kill -9 or pulling the cable and restarting the service? If a service can recover from a cable pull, is it reasonable to consider it should recover from restoring from a snapshot taken while live? If so, is there much point to stopping services during snapshots?

top 38 comments

sorted by: hot top controversial new old

[–] [email protected] 11 points 3 months ago (1 children)

You start the backup, db is backed up, now image assets are being copied. That could take an hour.

For the initial backup maybe, but subsequent incrementals should only take a minute or two.

I don't bother stopping services, it's too time intensive to deal with setting that up.

I've yet to meet any service that can't recover smoothly from a kill -9 equivalent, any that did sure wouldn't be in my list of stuff I run anymore.

[–] [email protected] 3 points 3 months ago* (last edited 3 months ago) (1 children)

It depends on the dataset. If the dataset itself is very large, just walking it to figure out what the incremental part is can take a while on spinning disks. Concrete example - Immich instance with 600GB of data, hundreds of thousands of files, sitting on a 5-disk RAIDz2 of 7200RPM disks. Just walking the directory structure and getting the ctimes takes over an hour. Suboptimal hardware, suboptimal workload. The only way I could think of speeding it up is using ZFS itself to do the backups with send/recv, thus avoiding the file operations altogether. But if I do that, I must use ZFS on the backup machine too.

I've yet to meet any service that can't recover smoothly from a kill -9 equivalent, any that did sure wouldn't be in my list of stuff I run anymore.

My thoughts precisely.

[–] [email protected] 2 points 3 months ago (1 children)

Oooh yeah I can imagine RAIDz2 on top of using spinning disks would be very slow, especially with access times enabled on ZFS.

What backup software are you using? I've found restic to be reasonably fast.

[–] [email protected] 1 points 3 months ago* (last edited 3 months ago) (1 children)

Currently duplicity but rsync took similar amount of time. The incremental change is typically tens or hundreds of files, hundreds of megabytes total. They take very little to transfer.

If I can keep the service up while it's backing up, I don't care much how long it takes. Snapshots really solve this well. Even if I stop the service while creating the snapshot, it's only down for a few seconds. I might even get rid of the stopping altogether but there's probably little point to that given how short the downtime is. I don't have to fulfill an SLA. 😂

[–] [email protected] 2 points 3 months ago

Yeah sounds like snapshots is the way to go!

[–] butitsnotme 4 points 3 months ago (1 children)

I don’t bother stopping services during backup, each service is contained to a single LVM volume, so snapshotting is exactly the same as yanking the plug. I haven’t had any issues yet, either with actual power failures or data restores.

[–] [email protected] 3 points 3 months ago* (last edited 3 months ago) (1 children)

And this implies you have tested such backups right?

Side Q, how long do those LVM snapshots take? How long does it take to merge them afterwards?

[–] butitsnotme 2 points 3 months ago (1 children)

Yes, I have. I should probsbly test them again though, as it’s been a while, and Immich at least has had many potentially significant changes.

LVM snapshots are virtually instant, and there is no merge operation, so deleting the snapshot is also virtually instant. The way it works is by creating a new space where the difference from the main volume are written, so each time the application writes to the main volume the old block will be copied to the snapshot first. This does mean that disk performance will be somewhat lower than without snapshots, however I’ve not really noticed any practical implications. (I believe LVM typically creates my snapshots on a different physical disk from where the main volume lives though.)

You can my backup script here.

[–] [email protected] 1 points 3 months ago

Oh interesting. I was under the impression that deletion in LVM was actually merging which took some time but I guess not. Thanks for the info!

[–] [email protected] 3 points 3 months ago* (last edited 3 months ago) (2 children)

Wouldn't restoring from such a backup be equivalent to kill -9 or pulling the cable and restarting the service?

Disclaimer: Not familiar with Immich, but this is what I've experienced generally.

AFAIK, effectively yes. The only thing you might lose is anything in memory that hasn't been written to disk at the time the snapshot was taken (which is still effectively equivalent to kill -9).

At work, we use Veeam which is snapshot based, and database server restores (or spinning up a test DB based off of production) work just fine. That said, we still take scheduled dumps/backups of the database servers just to have known-good states to roll back to if ever the need arises.

[–] [email protected] 2 points 3 months ago* (last edited 3 months ago) (1 children)

Thanks for validating my reasoning. And yeah, this isn't Immich-specific, it would be valid for any process and its data.

[–] [email protected] 2 points 3 months ago (1 children)

What i have seen for corporate server is when backup is started the database goes into a different mode, and a temp writable partition is used while readonly database is backed up, at end of backup that blob created is also stored.

[–] [email protected] 1 points 3 months ago* (last edited 3 months ago) (1 children)

Yeah if you're making a backup using the database system itself, then it would make sense for it do something like that if it stays live while backing up. If you think about it, it's kinda similar to taking a snapshot of the volume where an app's data files are while it still runs. It keeps writing as normally while you copy the data from the snapshot, which is read-only. Of course there's no built-in way to get the newly written data without stopping the process. But you could get the downtime to a small number. 😄

[–] gedhrel 2 points 3 months ago (1 children)

The other thing to watch out for is if you're splitting state between volumes, but i think you've already ruled that out.

[–] [email protected] 1 points 3 months ago* (last edited 3 months ago)

Oh yeah, that would be a disaster. If not handled correctly.

[–] gedhrel 2 points 3 months ago (1 children)

I'd be cautious about the "kill -9" reasoning. It isn't necessarily equivalent to yanking power.

Contents of application memory lost, yes. Contents of unflushed OS buffers, no. Your db will be fsyncing (or moral equivalent thereof) if it's worth the name.

This is an aside; backing up from a volume snapshot is half a reasonable idea. (The other half is ensuring that you can restore from the backup, regularly, automatically, and the third half is ensuring that your automated validation can be relied on.)

[–] [email protected] 1 points 3 months ago* (last edited 3 months ago)

Contents of application memory lost, yes. Contents of unflushed OS buffers, no. Your db will be fsyncing (or moral equivalent thereof) if it's worth the name.

Good point. I guess kill -9 is somewhat less catastrophic than a power-yank. If a service is written well enough to handle the latter it should be able to handle the former. Should, subject to very interesting bugs that can hide in the difference.

This is an aside; backing up from a volume snapshot is half a reasonable idea. (The other half is ensuring that you can restore from the backup, regularly, automatically, and the third half is ensuring that your automated validation can be relied on.)

I'm currently thinking of setting up automatic restore of these backups on the off-site backup machine. That is the backups are transferred to the off-site machine, restored to the dirs of the services, then the services are started. This should cover the second half I think. Of course those services can't be used to store new data because they'll be regularly overwritten with every backup. In the event of a hard snafu where the main machine disappears, I could stop the auto restore on the off-site machine and start using the services from it, effectively making it the main machine. If this turns out to be reasonable and working, I might trash all of the file-based backup-and-transfer mechanisms and switch to ZFS send/recv. That should allow to shrink the data delta between main and off-site to minutes instead of hours or days. Does this make any sense?

[–] MaximilianKohler 2 points 3 months ago* (last edited 3 months ago)

I ran into a similar problem with snapshots of a forum and email server -- if there are scheduled emails when you take the snapshot they get sent out again if you create a new test server from the snapshot. And similarly for the forum.

I'm not sure what the solution is either. The emails are sent via an SMTP so it's not as simple as disabling email (ports, firewall, etc.) on the new test server.

[–] solrize 2 points 3 months ago (1 children)

Stop the whole VM during snapshots.

[–] [email protected] 1 points 3 months ago* (last edited 3 months ago) (1 children)

Not a VM. Consider the service just a program running on the host OS where either the whole OS or just the service data are sitting on ZFS or LVM.

[–] [email protected] 1 points 3 months ago (1 children)

This is one of the reasons Docker exists.

[–] [email protected] 1 points 3 months ago* (last edited 3 months ago) (2 children)

And I'm using Docker, but Docker isn't helping with the stopping/running during backup conundrum.

[–] Hansie211 1 points 3 months ago* (last edited 3 months ago)

It should work that way. If you use the recommended Docker Compose scripts for immich, you'll notice that only a few volumes are mounted to store your data. These volumes don't include information about running instances. If you take snapshots of these volumes, back them up, remove the containers and volumes, then restore the data and rerun the Compose scripts, you should be right where you left off, without any remnants from previous processes. That's a pro of container process isolation

[–] [email protected] 1 points 3 months ago (1 children)

Why not?

[–] [email protected] 1 points 3 months ago* (last edited 3 months ago) (1 children)

Docker doesn't change the relationship between a running process and its data. At the end of the day you have a process running in memory that opens, reads, writes and closes files that reside on some filesystem. The process must be presented with a valid POSIX environment (or equivalent). What happens with the files when the process is killed instantly and what happens when it's started afterwards and it re-reads the files doesn't change based on where the files reside or where the process runs. You could run it in docker, in a VM, on Linux, on Unix, or even Windows. You could store the files in a docker volume, you could mount them in, have them on NFS, in the end they're available to the process via filesystem calls. In the end the effects are limited to the interactions between the process and its data. Docker cannot remove this interaction. If it did, the software would break.

[–] [email protected] 2 points 3 months ago (1 children)

docker stop container

Make your snapshot

docker start container

What am I missing?

[–] [email protected] 1 points 3 months ago* (last edited 3 months ago) (1 children)

That's the trivial scenario that we know won't fail - stopping the service during snapshot. The scenario that I was asking people's opinions on is not stopping the service during snapshot and what restoring from such backup would mean.

Let me contrast the two by completing your example:

docker start container
Time passes
Time to backup
docker stop container
Make your snapshot
docker start container
Time passes
Shit happens and restore from backup is needed
docker stop container
Restore from snapshot
docker start container

Now here's the interesting scenario:

docker start container
Time passes
Time to backup
Make your snapshot
Time passes
Shit happens and restore from backup is needed
docker stop container
Restore from snapshot
docker start container

Notice that in the second scenario we are not stopping the container. The snapshot is taken while it's live. This means databases and other files are open, likely actively being written to. Some files are likely only partially written. There are also likely various temporary lock files present. All of that is stored in the snapshot. When we restore from this snapshot and start the service it will see all of that. Contrast this with the trivial scenario when the service is stopped. Upon stopping it, all data is synced to disk, inflight database operations are completed or canceled, partial writes are completed or discarded, lock files are cleaned up. When we restore from such a snapshot and start the service, it will "think" it just starts from a clean stop, nothing extra to do. In the live snapshot scenario the service will have to do cleanup. For example it will have to decide what to do with existing lock files. Are they there because there's another instance of the service that is running and writing to the database or did someone kill its process before it had the chance to go through its shutdown procedure. In the former case it might have to log an error and quit. In the other it would have to remove the lock files. And so on and so forth.

As for th effect of docker on any of this, whether you have docker stop container or systemctl stop service or pkill service the effects on the process and its data is all the same. In fact the docker and systemctl commands will result in a kill signal being sent to the process of the service anyway.

[–] [email protected] 1 points 3 months ago

Oh I see -- you're asking a hypothetical.

The simple answer is that it's a bad idea to take snapshots of running databases because at best they could be missing info and at worst they can corrupt.

The short answer: Don't.

[–] [email protected] 2 points 3 months ago (1 children)

Check "green blue" deployment strategy. This is done by many businesses, where an interrupted service might mean losing a sale, or a client forever... I tried it sometime witj Nginx but it was more pain than gain (for my personal use)

[–] [email protected] 1 points 3 months ago

Good suggestion. I've done blue-green professionally with services that are built to have high availability and in cloud environments. If I were to actually setup some form of that, I'd probably use ZFS send/rcv to keep a backup server always 15 minutes behind and ready to go. I wouldn't deal with file-based backups that take an hour to just walk the dataset to just figure out what's new. 😅 Probably not happening for now.

[–] Evotech 2 points 3 months ago (1 children)

Modern image snapshot backups stop the service for av instant, creates a local snapshot to backup while the service runs a Delta then you apply the Delta to the running image

[–] [email protected] 1 points 3 months ago* (last edited 3 months ago) (1 children)

When you say stopping the service for an instant you must mean pausing its execution or at least its IO. Actually stopping the service can't be guaranteed to take an instant. It can't be guaranteed to start in an instant. Worst of all, it can't even be guaranteed that it'll be able to start again. When I say stopping I mean sysemctl stop or docker stop or pkill etc. In other words delivering an orderly, graceful kill signal and waiting for the process/es to stop execution.

[–] Evotech 2 points 3 months ago

Correct, just pausing it on the underlying platform

[–] [email protected] 2 points 3 months ago* (last edited 3 months ago)

Acronyms, initialisms, abbreviations, contractions, and other phrases which expand to something larger, that I've seen in this thread:

Fewer Letters	More Letters
LVM	(Linux) Logical Volume Manager for filesystem mapping
NFS	Network File System, a Unix-based file-sharing protocol known for performance and efficiency
SMTP	Simple Mail Transfer Protocol
ZFS	Solaris/Linux filesystem focusing on data integrity

4 acronyms in this thread; the most compressed thread commented on today has 4 acronyms.

[Thread #902 for this sub, first seen 1st Aug 2024, 21:25] [FAQ] [Full list] [Contact] [Source code]

[–] [email protected] 1 points 3 months ago (1 children)

If you're worried a out a database being corrupt, I'd recommend doing an actual backup dump of the database and not only backing up the raw disk files for it.

That should help provide some consistency. Of course it takes longer too if it's a big db

[–] [email protected] 1 points 3 months ago (1 children)

I dump the db too.

With that said if backing up the raw files of a db while the service is stopped can produce a bad backup, I think we have bigger problems. That's because restoring the raw files and starting the service is functionally equivalent to just starting the service with its existing raw files. If that could cause a problem then the service can't be trusted to be stopped and restarted either. Am I wrong?

[–] [email protected] 2 points 3 months ago (1 children)

I was talking about dumping the database as an alternative to backing up the raw database files without stopping the database first. Taking a filesystem-level snapshot of the raw database without stopping the database first also isn't guaranteed to be consistent. Most databases are fairly resilient now though and can recover themselves even if the raw files aren't completely consistent. Stopping the database first and then backing up the raw files should be fine.

The important thing is to test restoring :)

[–] [email protected] 2 points 3 months ago

Now this makes perfect sense.