How to Identify Duplicate and Similar Text in Php

It’s not a common problem but sometimes you have to check if 2 texts are similar. If you have to aggregate data from multiple sources you might know what I’m talking about.

The most simple thing you can try is to simply compare the 2 strings. A simple comparison will not help if one of the strings are contains an extra space. A more serious algorithm should be used for such cases. Fortunately php provides us several functions that can be used.

soundex() – accepts a string as a parameter and returns a string 4 characters long, starting with a letter. The result is called soundex key and words pronounced similarly produce the same soundex key. It should be used only for single words.
levensthein() – returns a number based on levenshtein algorithm which describes the similarity between 2 strings. It can be used only for strings with less than 255 characters.
similar_text() – calculates the similarity between two strings and can return the result as a number representing the number of matching chars in both strings or a similarity percentage.

soundex() it’s a php function which implements the soundex algorithm described by Donald Knuth in “The Art Of Computer Programming, vol. 3: Sorting And Searching”. It can be used to find similar words, misspelled words or to create indexes to simplify searches in databases when the pronunciation is known but not the spelling.

levenshtein() – levenshtein distance represents the minimal number of characters you have to replace, insert or delete to transform one string into another. The complexity of the algorithm is O(m*n). The function can be used only on string with less that 255 characters. If you want to aggregate rss feeds and eliminate duplicates, this function could be just fine if the rss items are truncated.

simple_text() – calculates the similarity between two strings. It returns the number the number of matching characters in both strings or a percentage of how similar the strings are. The function implements recursively and algorithm and has the complexity of O(N**3). The function is not so efficient as the levenshtein function but it can e used for strings more than 255 characters.

There are a few remarks about similar_text() you should consider:
– if you apply it on 2 texts of which one one is the first half of the other one the similarity would be 50% even if the strings are identically.
– the function might take a few seconds to compare strings of more than 20 000 on a regular computer.

Did you enjoy this tutorial? Be sure to subscribe to the our RSS feed not to miss our new tutorials!
... or make it popular on

3 Comments

  1. Thanks, had no clue these functions existed within PHP. Will probably come in hand in the future.

Leave a Comment.