Articles

Tracking user IDs in the crawler

We’ve just deployed a change to the crawler: we now store the user IDs of the users who crawled each URL in the index, along with a timestamp of when it was last crawled. Why track user IDs? Until now, we’ve required volunteers to go through a vetting process before they can crawl. Tracking user IDs lets us remove that requirement. The plan is that anyone can agree to the terms of service and generate their own API key to start crawling straight away. (API key generation isn’t implemented yet, but this change is the foundation for it.) ...

Matrix chat with Embed Search Engine

This article was originally posted on theoteno’s Inkwell blog. Where do I even start? Oh, of course with the long Discord/Amazon outage on a night 2 days ago… but unlike 1-2 weeks ago, I planned to watch anime with a new friend. (we met on a local anime Discord, I needed it for my local “Crunchyroll” registration) I needed a backup plan for later. Luckily, I was already a user of the Matrix chat protocol with Discord alternative clients, after being betrayed by Stoat(Revolt). Unless there’s a Flatpak for Stoat, I don’t think I can use it in Game mode on my Steam Deck. ...

Silent No Longer

This article was originally posted on my personal blog on 2nd August 2025. Dear friends, I am constantly besieged by the feeling that I am not doing enough. A genocide is unfolding before our eyes. I feel the guilt with every mother holding a starving child, with every doctor killed, with every journalist murdered, with every child shot trying to get food. Last week I sat in the bathroom and cried for half an hour, overwhelmed with grief. ...

Update July 2025

It’s been so long since we’ve had an update on the blog that people are often confused as to whether the project is still active. It definitely is! I’m just bad at updating the blog. Most of the updates have been going to the Matrix channel. So an update is long overdue. Most of the recent work has been about making Mwmbl efficient again. When we started, response times were well under 100ms, but over the years, they’ve slowly crept up. There are several reasons for this: ...

Re-ranking search results on the client side in Rust

By many measures, Mwmbl is doing great. We have indexed over half a billion pages, we have over 4,000 registered users, and over 30,000 curations from those users. Our volunteers are crawling around 5 million pages a day. But the score that I care about most right now is NDCG. This measures the quality of our search results against a “gold standard” which is just Bing search results for the same query. Obviously, we are not ultimately aiming to be like Bing, so eventually we will stop using Bing and start using our curated data, once we have enough and the quality is high enough. But we are far enough away from being good that moving in a Bing-like direction is great, for now. ...

Indexing a billion pages

It’s two years since we launched Mwmbl, the open source, non-profit search engine, on Boxing Day 2021. A good time to take stock of where we are and where we’re going. We’ve indexed over 100 million pages Thanks to our volunteers, who crawl the web using the Firefox extension and command line script, we’re crawling up to a million pages a day, as you can see on our stats page. There are around 50-60 users crawling on an average day. ...

Why is curation of web search results important?

Mwmbl is the first search engine to allow users to change the search results: You can add results, delete them, and rerank them. The changes you made are saved instantly to the index and will be shown to other users who run the same query. But what is the point of users changing search results? There are far too many queries to expect them all to be curated by users. ...

We are entering a new era of web search

We recently launched the new version of Mwmbl which includes the long-awaited feature of allowing users to curate search results. This is an experiment, since we don’t know: Will people want to curate search results? How will we determine what is an objectively good search ranking? How will we deal with and prevent spam? How will we build and manage the community? Can we use curated search results as training for learning to rank? Will this be better than the heuristics used by the big search engines? and probably a lot more important things I haven’t thought of. But this is why it is exciting! If we knew everything already, it would be boring. ...

Mwmbl Update - Over 100 million pages indexed

Highlights: We now have 105 million pages in our index We’re crawling around a million pages a day Around 60 people are helping to crawl the web each day The beta version of Mwmbl now allows user curation of search results Side projects My main problem with side projects is that I tend to stop working on them. It feels like I’ve started so many things only to abandon them, and that makes me feel bad. Over time, I’ve come to realise that this is a feature, not a bug. I work on side projects because I enjoy working on them. If I stop enjoying it, then I should stop working on them. ...

Fall 2022 Update

The Mwmble team took a bit of a great since August to work on some other areas of life, but we have been quietly planning the last several months around a couple of areas: Editing search results The most requested feature is to be able to suggest sites to crawl. We’re planning to go one better and let users edit the whole search results ranking and add their own search results. Promotion ...