this post was submitted on 09 Nov 2023
18 points (95.0% liked)

Selfhosted

39169 readers
478 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

  1. Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.

  2. No spam posting.

  3. Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.

  4. Don't duplicate the full text of your blog or github here. Just post the link for folks to click.

  5. Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).

  6. No trolling.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 1 year ago
MODERATORS
 

Does anyone know of any off the shelf tool (online or offline) to find duplicates in several DNS blocklists and merge them into one?

Context: I am running AdGuard on one GL.iNet router with ~10 blocklists some of them pretty huge and most of the times the lists are updated the router comes to one halt while doing so, having to often times reboot it through the old power-off-and-on.

I would rather download the lists myself from time to time and merge them into one file but with duplicates extracted somehow.

top 12 comments
sorted by: hot top controversial new old
[–] easeKItMAn 7 points 10 months ago* (last edited 10 months ago) (1 children)

If I'm understanding you correctly, you could make use of a shell script for this. Use WGET to download lists, then combine them into a single large file, and finally create a new file with no duplicates by using “awk '!visited[$0]++'”

wget URL1 URL2 URL3
cat *.txt > all.txt (This overwrites all.txt)
awk '!visited[$0]++' all.txt > no_duplicates.txt

[–] BinaryUnit 2 points 10 months ago

When no tool is available bash to the rescue, thank you for this it seems actually simpler then I thought :)

[–] CarbonatedPastaSauce 3 points 10 months ago (1 children)

I doubt you’ll find something off the shelf for this. I wrote a powershell script that deduplicates lists and also does a pass over the results to convert any blocks to CIDR notation. If you’re interested I’ll share it.

But honestly you could probably have ChatGPT whip this up for you in your language of choice. It’s pretty straightforward.

[–] nyar 0 points 10 months ago (1 children)

I'd like to see your script.

[–] CarbonatedPastaSauce 2 points 10 months ago* (last edited 10 months ago)

Sorry it took a while, I'm currently on vacation! But I had some time to reread it and sanitize it for public sharing. Here you go:

ok yikes, Lemmy really didn't like me pasting all that code even in a code block. I'll have to put it up somewhere else, stand by.

Hopefully this works better: Pastebin link

[–] [email protected] 2 points 10 months ago (1 children)

https://sortmylist.com/

I've used this for similar tasks before. Might need a few steps if the formats vary to get them all together but it should be possible.

[–] BinaryUnit 2 points 10 months ago

This is very helpful thank you :)

[–] [email protected] 2 points 10 months ago* (last edited 10 months ago)

Afaik pihole does parse and then merge the lists into a single block list.

Update: Nevermind. They do it by design (assuming this statement is still correct): https://github.com/pi-hole/pi-hole/issues/2013#issuecomment-817901839

What you could do is use any text editor and manually combine the text files with something like notepad++ and deduplicate from there. (Notepad++ can do it natively)

[–] [email protected] 2 points 10 months ago (1 children)

Isnt there a tool developed by the AdGuard team to handle exactly this?

Just looked through my files, look into this tool, it does exactly what you want: https://github.com/AdguardTeam/HostlistCompiler

[–] BinaryUnit 1 points 10 months ago

Thank you this looks promising

[–] tordenflesk 0 points 10 months ago (1 children)
[–] AtariDump 2 points 10 months ago

If you're looking for blocklists, I use /u/Wally3k's lists as well as the /u/LightSwitch05 “Developer Dan” lists.

I no longer personally use the OISD lists,- as the maintainer tells you not to use any other lists other than theirs making it difficult to impossible to use the groups feature. Instead, I’ll use a mix of lists and regex blocks. Nor do I recommend the “Quantum Blocklist that’s been going around - here’s why

I also suggest these regex blocks

Make sure you read what the different symbols mean with Wally’s blocklists before applying every blocklist. If you stick with the check-marked lists you should find that it blocks ads without too many false positives.

More blacklisted items doesn’t mean more items blocked; often time adding too many lists will break legitimate websites.

If you want to, you can reevaluate the added lists after 14-30 days using this tool (not supported by PiHole devs) to audit which lists are actually used. I’ve run this tool and discovered that several lists I added weren’t doing anything at all (If you need help with this tool please use the GitHub page to discuss).

With the release of v5 memory usage has been reduced when using additional block lists. Also note that with v5 lists are no longer “deduped”.