What is a near-dupe, really? | Clustify Blog – eDiscovery, Document Clustering, Technology-Assisted Review (Predictive Coding), Information Retrieval, and Software Development
>This article looks at three reasonable, but different, ways of defining the near-dupe similarity between two documents. It also explains the popular MinHash algorithm, and shows that its results may surprise you in some circumstances.document_comparison_toolNear-duplicates are documents that are nearly, but not exactly, the same. They could be different revisions of a memo where a few typos were fixed or a few sentences were added. They could be an original email and a reply that quotes the original and adds a few sentences. They could be a Microsoft Word document and a printout of the same document that was scanned and OCRed with a few words not matching due to OCR errors.
5 weeks ago by sharon_howard
That Looks Oddly Familiar — Jan Stępień
Perceptual hashing is a fascinating technique of summarising media files. It has little in common with cryptographic hashes such as SHA1. Two input files which look similar will end up having different cryptographic yet similar perceptual hashes. And by similar, we mean having most bits set the same way.

In this talk we'll combine pHash and a BK-tree to efficiently search through metric spaces of perceptual hashes. We will use Rust to implement a simple command line tool. It will ...
7 weeks ago by badboy
Vector and Line Quantization for Billion-scale Similarity Search on GPUs |
Hrm, I wonder if this hierarchical inverted index structure could be applied to specialist database software generally, or perhaps a general GPU database?
january 2019 by asteroza
Why Would a Java Engineer Love Frontend Development?
It often happens that backend developers don’t like working with a frontend. Even more, some hate frontend development. The complaints are always the same: J...
december 2018 by gilberto5757
Similarity search (implemented in Python)
it takes an input of a list of sets, and output pairs that meet the similarity threshold
november 2018 by GreggInCA

