Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Vector database built for scalable similarity search (milvus.io)
184 points by night-rider on March 25, 2023 | hide | past | favorite | 93 comments


I really don't want another database. I just want to have a solution built in for Postgres, and more specifically, RDS, which we use. I know there will be some extra difficulty that I will have to manage (e.g. reindexing to a new model that is outputting different embeddings), but I really don't want another piece of infrastructure.

If anyone from AWS/Google/Azure is listening, please add pgvector [1] into your managed Postgres offerings!

1. https://github.com/pgvector/pgvector



Very cool, thanks.


Totally with you on that - it's annoying to need multiple pieces of infrastructure, especially for solo developers or small teams. Often times, you want to filter further based on scalar fields/metadata such as timestamps, strings, numeric values, etc...

That's why we built attribute filtering into Milvus via a Mongo-esque interface. No SQL and not as performant as an RDBMS, but it's an option: https://milvus.io/docs/hybridsearch.md


Yes exactly. My company has asked AWS if they will be adding support for pgvector for rds but they haven't been able to confirm if that will happen any time soon.

If the vectors are in the same database as the tabular/structured data then text to sql applications of llm's are so much more powerful. The generative models will then be able to form complex queries to find similarity as well as perform aggregation, filtering and joining across datasets. To do this today with a separate dedicated vector db is quite painful.


You could write a FDW that reads/writes to a vector database using postgres id tagged vectors. You can write to it from postgres, reference it in queries, join on it, etc. That cuts out a lot of the pain from having separate databases, the only remaining issues are additional maintenance overhead and hidden performance cliffs.


With Postgres, you can do almost everything, also a full-text search, but you still have Elasticsearch, Mejlisearch, etc when you need performance and advanced features. The multitool approach is suboptimal in most cases.


In small teams, the infrastructure is often not able to be fully utilized, so performance is not an issue. However, feature richness allows this team to deliver higher-level feature faster. Think early stage startup (one or two engineers) or hairdressers-like business (they use a ready-made framework that targets a popular database and limits its feature to have a wide range of users). As a result, you can have a lot of such business creating a very long tail.


For small startups is better just to utilize a managed solution like Pinecone or Qdrant and do not take about infra at all.


SaaS products are infrastructure. Each different SaaS used is another piece that needs to be connected to the system and maintained; it thus becomes part of the system infrastructure. Each new SaaS piece has costs (time and money) associated with it.

That said, it's up to the individual company to decide if the added cost is worth it. Just because the cost exists doesn't mean it isn't worth it.


Open source software nowadays are very easy to use.

If your guy couldn't get a single open source software straight, you had the wrong guy :(

I can only see managed service useful when I had 100X traffic and when strong SLA is required.


I'm with you there. It seems like an extension to existing DBs would be better. I would like something like this for a file based DB like sqlite.


sqlite-vss is an extension that adds vector search to SQLite: https://observablehq.com/@asg017/introducing-sqlite-vss


txtai combines SQLite and Faiss to enable vector search. It also does a lot more than that.

https://github.com/neuml/txtai


I work at Zilliz (https://zilliz.com) and am a part of the Milvus community. Here are some other resources in case anybody's interested in learning more about embeddings, vector search, and vector databases:

1) Embedding crash course https://developers.google.com/machine-learning/crash-course/...

2) What is a vector database? https://zilliz.com/learn/what-is-vector-database

3) Introduction to vector similarity search https://zilliz.com/blog/vector-similarity-search

4) ANN benchmarks http://ann-benchmarks.com

5) Unstructured data ETL https://github.com/towhee-io/towhee


Does Zilliz do BYOC?


Not yet, but this functionality should be coming soon. We're currently working on adding the capability to call third party embedding APIs directly from a Zilliz Cloud instance.


I was mulling over the idea of building keyword image search (say, with CLIP based embeddings). However I'm not really sure the cost, or whether or not this is the best solution. Do you have any case studies about large deployments of this software, and what the upper limits of scale might be?


As another commenter noted, Milvus is overkill and a "bit much" if you're learning/playing.

A good intro to the field with progression towards a full Milvus implementation could be starting with towhee[0] (which is also supported by Milvus).

towhee has an example to do exactly what you want with CLIP[1]. For icing on the cake the example notebook includes deployment of the model with Nvidia Triton Server[2], which IMO is hands down the best way to actually deploy an ML model[3].

[0] - https://towhee.io/

[1] - https://github.com/towhee-io/examples/tree/main/image/text_i...

[2] - https://github.com/triton-inference-server/server

[3] - https://news.ycombinator.com/item?id=35199418#35200266


Don't start with mullivus if you're learning. Too much yak shaving. Try https://github.com/criteo/autofaiss.

Also, TBH, it is a lot cheaper to run a simple faiss index.


Milvus standalone is pretty neat. I made it successfully run on the first try.


What is BYOC? Bring your own Cloud?


Learn faiss first. Mullvus is very complicated in comparison and faiss will work in many cases.

I mean look at their system diagram https://milvus.io/static/0bc2e74d0a1b20bbfb91bdbd03f77e5e/bb.... Not fun...


Looks pretty standard to me for a high availability cloud setup with disaggregated storage?


It feels like there are an influx of "vector databases" right now, I haven't had a strong answer out of anyone on why you'd be better off using these over Redis which offers vector storage with similarity search in a battle-tested OSS solution.


I'm currently evaluating different vector stores and passed on Redis today after spending about a half day looking into it. Here's my reasoning

    1. The Node.js client is designed to be just a thin wrapper around Redis commands. The client's docs basically just point you straight at the Redis docs.

    2. The `@redis/search` API is slightly different than the FS.SEARCH Redis command's api. The difference is not documented and cost me ~20 minutes. This is a pretty big problem given item 1.

    3. The `@redis/search` TypeScript types didn't allow some valid FS.SEARCH calls.
One thing to note is that for some use cases (IMO probably a large fraction of use cases right now), the vectors are derived data, so you don't necessarily need a super robust/battle-tested "database", you just need something with a simple happy-path API.


1. node-redis or ioredis? If you are using a "modern IDE" (VS Code/WebStorm/any IDE with support for TS language server) you should get autocomplete for all the commands (in both of them).

2. The `@redis/search` package extends `@redis/client` with support for all RediSearch commands (you should get autocomplete for all RediSearch commands as well). If you want to get the "whole in one" package (support for redis vanilla + all redis modules) you can use the `redis` package instead.

3. Can you please share the command you are trying to run?

BTW, these examples might help:

1. https://github.com/redis/node-redis/blob/master/examples/sea...

2. https://github.com/redis/node-redis/blob/master/examples/sea...

3. Checkout `redis-om` https://github.com/redis/redis-om-node/tree/main


What do you think about Weaviate or Qdrant? There is a nice benchmarks overview with all the major players https://qdrant.tech/benchmarks/


Qdrant v1.1 was released recently and its quantization feature is just fantastic . See: https://github.com/qdrant/qdrant


I haven't really looked into either. So far I've tested pinecone, Redis, chroma, elasticsearch, and pgvector. I'm not really considering performance at all, just looking for something dead-simple to deploy and use. At the moment it looks like pgvector on supabase is the winner.


Yeah, if you don't need performance, then just take a simple ANN library. No need for a database at all. In terms of databases, Qdrant and Pinecone are the simplest I've tried so far. But Pinecone isn't open-source, not an option for on-prem. PGvector is too much imho, do not want to have all the other Sequel stuff if I just need an NN search.


Not trying to say you're wrong in your decision, but in the context of the post you're replying to, aren't these points arguments against the node client rather than redis itself?


One positive thing I can say about Weaviate is that it's incredibly simple to get started with it, as it handles talking to the vectorizer(s) for you, and they provide many out of the box container images for them. With that we were able to integrate a similarity search into our product very quickly.

From what I can tell, its competitors generally don't do this, and you have to handle generating the embeddings and all the back and forth with the embeddings yourself, setting the barrier for integration much higher.

If you are at a level where you want to get a lot more control about embeddings, I agree that there it will often be better to just use the vector search capabilities that your existing solution like Redis/Postgres are getting.


Correct - this is actually a core focus of Weaviate. That's why it's not limited to only ANN but also has an inverted index, optional modules, etc.


Or Postgres with pgvector

https://github.com/pgvector/pgvector


Redis uses too much memory and supports only two NN algorithms FLAT (very slow) and HNSW if you start indexing millions of large vectors you will quickly understand the problem.

That being said, many of these DBs are overcomplicated for most use cases. Redis HNSW will work for many use cases.


Enterprise has tiered memory. We have customers in the 50-100m range. benchmarked more than that as well. Single query latency will usually be better than most of the field. Also, were working on other variants of our index that will scale to 500m +

More to come. https://github.com/redisventures to keep up with our team


Redis does fairly well single-threaded - iirc it does better than Milvus. The problem is, once you move to larger applications and want higher performance, its architectural limitations make it impossible to scale.


it uses a lot of memory because it's an in memory db, that's by design.


We've seen a huge uptick in users of Redis VSS. This thread is great feedback for us so thank you.


Vector search is still a semi active area of research. You can't get "battle tested" and optimal performance at the same time.


Funnily enough, I just started working on a vector search extension for SQLite today: https://github.com/Ronsor/VecDex


That is great! I'll keep an eye.

I've been playing with this extension: https://github.com/asg017/sqlite-vss


Related:

Milvus – An Open-Source Vector Similarity Search Engine - https://news.ycombinator.com/item?id=22012300 - Jan 2020 (26 comments)

(for curiosity purposes only – reposts are ok after a year or so: https://news.ycombinator.com/newsfaq.html)


A good set of resources if you wanna check out some other ones, been reviewing some myself. https://github.com/currentslab/awesome-vector-search

I was surprised to see Elastic actually has ok support for some of this stuff, though it appears slower for most of the tasks.


We added HNSW-based vector search to Typesense as well recently: https://typesense.org/docs/0.24.0/api/vector-search.html

So you can combine attribute-based filters along with nearest-neighbor search.

Put together this semantic search + filtering demo just last week: https://github.com/typesense/typesense-instantsearch-semanti...


Thanks for the link. Nice to see Marqo on there (disclaimer I am a co-founder of Marqo). For anyone that is interested it includes a really nice api for handling a lot of the manipulations and operations you want to do (adding, updating, patching documents, filtering, embeddings only a subset of fields, multi-modal querying, multi-modal document representations) which are absent from vector db's. It also takes care of inference https://github.com/marqo-ai/marqo


tbh. Looks like a huge overengineered legacy project. What is the clue to having all these ANN indexes in place? Is it a kinda art collection? What is the sense when you can just have HNSW in memory, with quantization, or on disk, GPU accelerated, etc. There are already better alternatives like Qdrant, which is written in Rust and super performant https://github.com/qdrant/qdrant, or Weaviate with GraphQL interface https://github.com/weaviate/weaviate


did you try to serve billion level vectors, on both projects you mentioned? Then you can define what is overengineered.

And what about ScaNN and GPU index? the flexibility to support multi index type is a actually a big plus


What algorithms are these databases using for fast similarity search?

Are they guaranteed to return the most optimal similar neighbors or will they be sloppy in return for efficiency?


One key differentiator of Milvus is the ability to use a variety of different indexing algorithms. Some will consume very little memory and provide significant speedup at the expense of recall, while others are more powerful at the expensive of compute time or memory consumption.

The most commonly used one I've seen is HNSW, which greatly speeds up search at the expensive of some extra memory consumption. Recall is also strong - for many large-scale use cases you can get 95%+ with properly tuned parameters.


I haven’t tried many alternatives, but I needed a fast, self-hosted vector similarity search that had the ability to cull results based on a second criterion. Milvus has worked really well for me. It does take a ton of memory though such that I can’t run on small VMs, and runs a large number of supporting services.


hnswlib? Local. Single file. Allows pre-filtering.


Ah I see, they released Milvus 2.0 which is similar to Pinecone. What are the differences?


Milvus is completely open source (https://github.com/milvus-io/milvus) and supports a variety of index types (https://milvus.io/docs/overview.md#Index-types) and support various consistency levels, scalar/metadata filtering, and time travel. We started working on Milvus back in 2018, with 2.0 being released in January 2022 (https://github.com/milvus-io/milvus/releases/tag/v2.0.0).

For those interested, here's a comparison with other open source vector databases: https://zilliz.com/comparison. For those who don't want to be burdened with installing and maintaining a local database, there's a managed service available as well: https://zilliz.com/cloud.


I'm affiliated with Qdrant and was quite surprised to see us listed in your comparisons with some false statements. If you claim to be the only database with billion-scale vector support, it would be great to make your benchmarks public, as we did: https://qdrant.tech/benchmarks/

Btw, Milvus is described in your comparisons as "a fully open source and independent project", while Weaviate and Qdrant, in contrary, are "maintained by a single commercial company offering a cloud version". Why then the suggested way in the Milvus Quick Start on github is to use Zilliz Cloud?


would you please update the milvus score in your benchmark with the latest milvus, thanks


Thanks! BTW it only compares open-sourced databases, and there's no Pinecone in it, which is a very popular cloud-managed option. Would be cool if you could benchmark/compare it.


Biggest difference is Milvus is self-hosted (manage your own infra) while Pinecone is a managed service. There are other differences but that’s often a driving factor.

We (Pinecone) tend to attract customers who want to start, ship, and scale quickly and reliably without worrying about infra and ops overhead.

https://www.pinecone.io


I think this update introduces Milvus 2.0, "Managed Milvus", which seems to be not self-hosted, but in-cloud managed. You can create account and see that it's similar to Pinecone.


Yup! Zilliz Cloud (managed Milvus) provides free credits for you to get started with a full-service (no throttling or other limitations) PoC as well.


I only heard about vector databases along with the recent advents of AI. Assuming they've been around for a while, what were the benefits of using them over "normal" search engines (e.g. ElasticSearch)?


Traditional search such as ES and Lucene rely primarily on bag-of-words retrieval and keyword matching e.g. BM25, TF-IDF, etc. Vector databases such as Milvus allow for _semantic_ search.

Here's a highly simplified example: if my query was "fields related to computer science", a semantic engine could return "statistics" and "electrical engineering" while avoiding things such as "social science" and "political science".


OpenSearch (elasticsearch open-source fork by AWS) has supported similarity search for embeddings for a couple of years now with its k-NN plugin. It supports 2 engines - FAISS and HNSW - and has post-result filtering support and replication support. A good option, imo.


ES has support for vector search now too. Really you want both in use cases where the user expects the the top results to contain the search keywords, but also wants results that are synonyms or conceptually similar. TF/IDF and BM25 help with first part and vectors help with the second. Theoretically only vectors should be needed, but that isn't my experience in practice.


Totally agree. The thing is that ElasticSearch does not meet our requirements in vector searching.

I am currently running with Milvus + ElasticSearch, works perfect. The latest Milvus version is super fast and scalable (>50M vectors). Haven't tried Zilliz Cloud. Have to find out what the cost is.

I am old school. IMO ElasticSearch is only good for keyword search and these so called "vector databases" products are only good for vector search.


If ES doesn't work for you, I recommend Vespa. https://github.com/vespa-engine/vespa

Others have made other suggestions, but Vespa has two unique features. First it is battle tested at a large scale, second it supports combining the keyword and vector scores in several ways. The latter is something that other hybrid systems don't do very well in my experience.


Didn't even realise Milvus was so lacking. https://github.com/marqo-ai/marqo also has a hybrid approach. It's just a more complete/end-to-end platform than pinecone, so it really just depends on what you're building


I personally like Milvus very much.

My point is I only trust stuff that focuses their own business. Especially for small startups.


Could you please elaborate on how you utilize both of them together, and for which specific use case? I'm attempting to gain a better understanding of the hybrid approach.


Certainly!

The thing is to make ElasticSearch scores "comparable" to Milvus scores. Lots of ways to do this, but there's no single good solution. For example you could calculate BM25 score offline, or use TF-IDF score to do some kind of filtering. Again there's no single perfect answer. You'd have to do a lot of experiment according to your own use case and your own data to get the best results.

Also a lot of tuning needs to be done during all phases: 1) query pre-processing 2) query tokenizing 3) retrieval 4) ranking and reranking

I personally would not trust any universal "hybird-search" solutions. All toy demos.

It usually takes 5-10 good engineers to build a decent search engine/system for any real use case. It also requires a lot of turning, tricks, hand-written rules to make things work.


Which is why Pinecone supports hybrid search, which has shown to provide better results for out-of-domain use cases than either semantic search or keyword search alone: https://www.pinecone.io/learn/hybrid-search-intro/


IMO vector databases should not mess with ElasticSearch.

The real focus should be to improve the recall of vector search. Pity that nobody is doing real AI research here. Money wasted in marketing and branding.


This is a good explanation by Niels Reimers (who created SBERT): https://www.youtube.com/watch?v=ukIYZw3uRX0


Somebody already mentioned it on this thread, but with Weaviate, we aim to create more end-to-end experience that can be used open source or as (hybrid-)SaaS. For example, you can store the complete data object, ANN and inverted filters (hybrid search), and optional modules for vectorization. More info about the concept behind Weaviate here: https://weaviate.io/developers/weaviate/concepts


Surprised to see that no one mentioned https://vespa.ai/ the best alternative to vector DB imho


+1, Vespa is on another level. 3 years of production and its been a dream!


How does this compare to pinecone?


Pinecone is closed-source AND hosted-infra only... which is a non-starter for many companies.


Pinecone offers a free managed tier, which was quite nice until it lost my data last month. They did eventually recover it a few days later, to be fair to them.


Qdrant also offers a 1GB free tier https://cloud.qdrant.io


Check out trychroma.com


I am not sure why Milvus is so popular. There are other pretty good options like Weaviate and Vespa - both are open source.

One important limitation of Milvus that I cannot ignore is that it does not support pre-filtering like Vespa and a few others do. Pre-filtering is critically important for use cases in e-commerce etc. where filters based on structured metadata are applied e.g. brand, category etc.


I'm very sure milvus supports it. Milvus calls it hybrid search though: https://milvus.io/docs/hybridsearch.md#Conduct-a-hybrid-vect...


Don’t laugh at me, but is there a vector database solution for MySQL/Mariadb? It’s been my database of choice for years


Can these databases do fast averaging of nearest neighbors for regression or do you have to retrieve the neighbors and manually compute a mean across them?


They are primarily for storing wmbeddings for similarity search - not general purpose nearest neighbor algorithms.


You'll have to manually compute the mean after retrieving the nearest neighbors. Automatic fast averaging could be an interesting use case for clustering or using a vector database to generate training though.


Interesting. What systems or algorithms are you referring to by “clustering”?


Could you please add a pause button for the animations on your home page? They jump unexpectedly when you are reading the code snippet.


Tried Milvus first and never really got it off the ground. Ended up with Qdrant for its simplicity.


Don't start with Milvus clustered version, not unless you have like 100million vectors.

Try Milvus standalone instead, much simpler. I also just found their python version (https://github.com/milvus-io/embd-milvus), which is quite neat.


Also my experience. And since Qdrant has now cloud service, it is the choice number one. Best of both worlds. Just missing the GCP support.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: