this post was submitted on 16 Dec 2023
96 points (96.2% liked)

Programming

17853 readers
288 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Rules

  • Follow the programming.dev instance rules
  • Keep content related to programming in some way
  • If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities [email protected]



founded 2 years ago
MODERATORS
 

So, in the era of increasingly good AI powered tools and general search engines full of SEO spam, last week I started creating something little old school and against the trends.

For now It's a have-fun-and-find-out project that main aim is to provide good search results for general web development queries with a special focus on independent blog authors.

The thesis is that no SEO spam website is in the index, which will already filter out most annoying noise on Google/Bing.

Search results are grouped per type: docs, blogs and magazines (e.g. blog platforms or bigger websites).

For now it's far from being done in terms of having a full index, but in most cases it already replaces my go-to search engine when I'm looking up some stuff during work.

I'm looking forward hearing out what y'all think and if you think it makes sense overall I can only encourage you to post some links to blogs or docs that are still missing in the index. I'm more than happy to add it to the crawler.

Responds like: "nei, total shit, who would need that" also accepted but constructive critique more appreciated ;)

EDIT: everyone many thanks for all your voices and comments. I'm super grateful for all of them and happy that we have such place like Lemmy!

all 40 comments
sorted by: hot top controversial new old
[–] sznowicki 61 points 1 year ago* (last edited 1 year ago)

I like how first queries you guys make are attempts to SQL inject and XSS it.

EDIT: if you find something let me know, PRs also welcomed ;)

[–] nezbyte 13 points 1 year ago (1 children)

There was a programming search engine called Symbol Hound that allowed for searching for special symbols like << and &&. It was my fallback search engine while programming if I couldn’t find something on the first page of Google. Sadly, that site appears to have disappeared. Does this search engine have optional support for special characters?

[–] sznowicki 3 points 1 year ago (1 children)

If it has it's totally accidental.

What's the use case for searching for those kind of symbols? I'll check if I can tune it for this.

[–] [email protected] 8 points 1 year ago (2 children)

When you want to know the name of the operator for a language.
Like "what does & mean in c++?".
&amp; isn't too bad, but some of them can be difficult (like "JavaScript ??").
And if you don't know it's called a reference operator, or a bullish coalescence operator, or whatever... Trying to learn what it does can be downright impossible

[–] sznowicki 4 points 1 year ago

For ?? I guess it already has a decent results. I’ll periodically check those kind of cases once the index gets more languages.

https://kukei.eu/?q=js+%3F%3F+operator

[–] [email protected] 8 points 1 year ago* (last edited 1 year ago) (1 children)

I think the main issue as well as my main question is around scope.

You say targets we developers, but the current index is quite narrow. So will you accept significant expansion of that, as long as it may be relevant to Web developers? Where would you draw lines on mixed c content or technologies?

ASP.NET docs is definitely docs for web developers. But maybe not what you had in mind. Would that apply? The docs are h hosted on a platform with a lot of other docs of the dotnet space. Some may be relevant to "Web developers", others not. And the line is subjective and dynamic.

My website has some technological development resources and blog posts. But also very different things. Would that fit into scope or not?

How narrow out broad would you make the index?

I guess it's an index for search, so noise shouldn't be a problem as long as there are gains through/of quality content.

[–] sznowicki 8 points 1 year ago* (last edited 1 year ago) (1 children)

It's still in MVP, work in progress, hence the index is not "full".

For me "web development" is everything that we might need for well, web. Servers, mongo docs all goes into the index (I'm adding it every day basically but also it takes some time to index stuff and I observe how this whole thing works as index grows).

ASP.NET goes into the index of course. If your website has dev resources and blog posts that would go into it as well. Recently one person suggested tons of Haskell blogs and they are being indexed as we speak.

I have also a different problem, dev.to has a lot of good resources but also tons of SEO spam and low quality content. It's also freaking huge and while it was for some time in the index I had to remove it and think about it some more.

Where would you draw lines on mixed c content or technologies

For now the line is: does this website have anything that web devs would need? Yes? Then it might get in.

If it's a blog about locomotive CPU programming then maybe not. Although mostly due to infrastructure costs. Indexing cost in the end but having some non related stuff in the index should not hurt the results.

All of what I wrote is the state for today. I'm changing my mind often as it's still in "having fun" state.

PS. also thanks for the feedback!

[–] [email protected] 3 points 1 year ago* (last edited 1 year ago) (1 children)

I have also a different problem, dev.to has a lot of good resources but also tons of SEO spam and low quality content. It’s also freaking huge and while it was for some time in the index I had to remove it and think about it some more.

Yeah, a public platform is unlikely to provide consistent content. If curation is not an explicit goal and practice there, I would not include them for the reasons you mentioned.

If indexing could happen not on domain but with more granular filters - URL base paths - that may be viable. Indexing specific authors on devto.

[–] sznowicki 2 points 1 year ago (1 children)

Good idea. I had this thought once to do some narrow indexing of websites, e.g. stack overflow is a big issue, indexing all of this is crazy, picking up some specific tags on the other hand feels like tons of work. In the end I adjust the whole project as it grows with hope that after every tuning it gets better.

As long as I have fun with it I’ll continue :D

[–] [email protected] 2 points 1 year ago

Of course - cutting scope is a good call to keep it manageable and fun, and not end up with creep and what you wanted to evade in the first place. :)

[–] bigredgiraffe 6 points 1 year ago (1 children)

This is a cool idea! I did notice that on mobile the search results are wider than the viewport and if I had a feature request it would be to make them way, way more compact but that might just be me hah.

You should also check out the Lenses feature that Kagi has, I think every search engine needs that feature now hah. I bookmarked your site for the next time I am searching for sure though!

[–] sznowicki 1 points 1 year ago* (last edited 1 year ago) (1 children)

Thx for the comments. I’ll fix the mobile view and will definitely redesign it all a bit over weekend. I see a lot of room for improvements.

Also will check how to submit it to Lenses. Highly appreciate it!

EDIT: mobile view is fixed, also did some small adjustments in the whitespaces between result items.

[–] bigredgiraffe 1 points 1 year ago (1 children)

Yeah! Granted I have an iPhone 12 which is small for a modern phone but I figured I should mention it :D

I have been thinking more about this idea and I love it even more, I feel like domain specific search engines are going to be more and more important in the future as the results of the major search engines get even worse and worse.

Awesome work!

[–] sznowicki 2 points 1 year ago

I'm on iPhone 12 mini. I love that small design and I strongly believe phones should be small.

Thanks for the good words! Highly appreciate it!

[–] [email protected] 5 points 1 year ago (1 children)

Index categories are blog, docs, magazines. Have you considered indexing source code websites?

I thought I would remember a second one, but I can't recall right now.

Subpaths on GitHub and GitLab would be a similar fashion but would require more specific filters - unless they are projects hosted on dedicated instances.

Project issue tickets may also be very relevant to developer searches!?

[–] sznowicki 2 points 1 year ago* (last edited 1 year ago)

Great ideas. For the source code I’m not sure but I’ll put it to the backlog of cool things I get from Lemmy and work on them one by one. Thanks!

[–] [email protected] 4 points 1 year ago (1 children)

I often get annoyed when I Google/ddg something like "python3 sort list in place" and get some blog, w3schools, and geeks for geeks, before I get the standard python docs. Just tell me if it's [].sort() or sorted([]) !

Honestly for that kind of question I want the docs more than I want stack overflow.

Maybe I should just bookmark the docs instead of using search, come to think of it. But if your search prioritizes official docs that sounds like a plus.

[–] stabbie_mcgee 2 points 1 year ago

Another option aside from bookmarking is DDG bangs, I have DDG as my default search, so I can just type !py sort list into the address bar and go directly to the python documentation.

[–] DrakeRichards 3 points 1 year ago (1 children)

It’s a good start. I’m curious why you didn’t include a section for social media like StackOverflow or Reddit. If I go to Google with a question, it’s usually for an edge case not covered by the documentation. Maybe add them as a section at the bottom to indicate that they might be less relevant?

Also, this might just be a web developer thing, but why include blogs? Almost all coding blogs I’ve seen are SEO cancer that just copy from the documentation or each other. Are there actually useful blogs out there that I’ve just been missing?

[–] sznowicki 3 points 1 year ago (1 children)
  1. SO and Reddit are on the TODO list. It even had SO (in the bottom indeed) once but not via crawling, via SO Search API. It has very poor quality results and was super slow so I had to remove it while thinking of a better solution. Crawling entire SO might be little too much of this project at this state tho but if I have enough courage and hours at night I might parse that 20GB stack overflow archive dump and try doing something useful with it.

Same for Reddit but here I have mixed feelings about it in general and hope it's going to die soon being replaced by amazing Lemmy communities.

I also used to type some question and end with "reddit" in Google to get good quality content, but here with kukei the experiment is whether blogosphere can replace it properly when index is promoting it.

  1. Why blogs?

This is my main thing. To promote good quality blogs that I tried to follow via RSS but somehow never did. Having them all indexed (and more, some Mastodon community gave me amazing links to index) makes me actually visit them often.

For the "SEO cancer" that where curation comes into play. Before crawling I check unknown blogs to me and decide whether something goes in or not.

[–] DrakeRichards 3 points 1 year ago* (last edited 1 year ago)

That makes sense. I really like that the documentation is right at the top; many times all I want to do is find the right page in the official docs. You might want to look at how results are prioritized though: right now when I search for something simple like “how to center a div”, that result from Mozilla’s docs is included but it’s hidden as the second or third result. I would expect the page that’s explicitly about centering a div to be the top result, followed by the docs page for the element itself and maybe pages for flex or grid or something. That’s a really simple example, so maybe it’s not the target of this project, but I would still hope that simple topics are covered just as well as complex ones.

EDIT: I was a bit mistaken: “how to center a div” does bring up the Mozilla documentation for centering an element, but “center a div” brings up a page about accessibility as the top result.

[–] kerneltux 3 points 1 year ago (1 children)

I really like the simple design that separates the results into docs/blogs/magazines. Obviously, the results reflect the current state, but I appreciate your approach in both the design & sourcing the search results. I think there's a lot of potential for this to be a regular part of my toolbox, hopefully this takes off!

[–] sznowicki 1 points 1 year ago

Thanks for the kind words!

[–] [email protected] 3 points 1 year ago (1 children)

Can you add links to each section at the top so you dont have to scroll past ones you might not be interested in?

[–] sznowicki 3 points 1 year ago

Another person in real life told the same. Adding to the backlog!

https://github.com/Kukei-eu/kukei-web/issues/3

[–] [email protected] 2 points 1 year ago (1 children)

Looks cool, I think I'll add it to firefox and use sometimes

[–] sznowicki 0 points 1 year ago

Thanks! If you have some suggestions in the future I’m always open to hear

[–] [email protected] 1 points 1 year ago (1 children)

Oh, what an interesting idea! I like this, on Monday I'll test out switching to this as my main search engine for work and try to report back how it goes!

[–] sznowicki 2 points 1 year ago

Thanks but don't expect too much yet. Many sources are still missing. If you notice something should be there but it's not even being crawled feel free to reach me one Mastodon or add it directly via PR here: https://github.com/Kukei-eu/spider/blob/main/index-sources.js

[–] [email protected] 1 points 1 year ago (1 children)

How is it specifically dev focused? How will the crawler know that the site or page is dev related?

[–] sznowicki 2 points 1 year ago

The crawler takes only the sources that are defined in the crawler repo (it’s open source, check the github org or kukei-spider).

So in this way it’s “curated” in a sense that it would not add anything else to the index.