How to compare textual documents

April 4, 2023

Konstantina Lazaridou, Machine Learning Engineer


Working with texts can be very fun, but it can also be tedious at times. Written documents (either by humans or AI models) can be complex in terms of structure, they might contain creative or unique language, and sometimes even mistakes. Analysing a piece of text automatically can save time. For instance, we could easily compare news articles on a given topic from different publishers and understand how the original content changed over time from one media source to another.

This post is a tutorial for text (string) comparison with the Python libraries difflib and fuzzywuzzy.

We start with a showcase of difflib, which calculates the differences (deltas) between two sequences of items. This library can be used for other types of data as well, like files.

from typing import List, Dict
import difflib


def get_changes_in_text(first_text: str, second_text: str) -> List[Dict]:
   """Compare the first with the second text and return the list of fine-grained changes.
   The list contains one dictionary for every change.
   The first text could be restored by applying the listed changes in the respective indices.
   Args:
       first_text (str): First sentence or document
       second_text (str): Second sentence or document


   Returns:
       List[Dict]: List of dicts, where each dict describes a change: {start index, end index, token, diff}.
       The diff is either an addition (+) or a deletion (-).
   """
   changes = []
   current_token = ""
   start_index = 0
   current_index = 0
   current_diff = ""
   # iterate over the descriptions of all differences between the first text and the second one
   for i, s in enumerate(difflib.ndiff(first_text, second_text)):
       if s[0] != " ":
           print(f"i={i}, s={s}")
       # if there is a diff indicated in `s[2]` and we are not still processing the current change
       if s[2] != " " and (not current_diff or current_diff == s[0]):
           current_token += s[2]
           # newly seen change
           if current_index == 0:
               start_index = i
               current_index = i
           # consequent character changes
           else:
               current_index += 1
           current_diff = s[0]
       # if the description of the current change is complete, store it with spacy-like format and reset
       elif s[2] == " ":
           # there is added text to the input
           if current_diff == "+":
               changes.append(
                   {
                       "start_index": start_index,
                       "end_index": current_index,
                       "token": current_token,
                       "diff": "+",
                   }
               )
           # there is removed text from the input
           elif current_diff == "-":
               changes.append(
                   {
                       "start_index": start_index,
                       "end_index": current_index,
                       "token": current_token,
                       "diff": "-",
                   }
               )
           current_token = ""
           start_index = 0
           current_index = 0
           current_diff = ""
       # if the description of the current change is complete and a new description starts, store current one and move on to next one
       elif current_diff != s[0]:
           if current_diff == "+":
               changes.append(
                   {
                       "start_index": start_index,
                       "end_index": current_index,
                       "token": current_token,
                       "diff": "+",
                   }
               )
           elif current_diff == "-":
               changes.append(
                   {
                       "start_index": start_index,
                       "end_index": current_index,
                       "token": current_token,
                       "diff": "-",
                   }
               )
           current_token = s[2]
           start_index = i
           current_index = i
           current_diff = s[0]
   return changes

The above code snippet introduces the function get_changes_in_text() that collects all delta information between two input texts into a list of Python dictionaries. Each dictionary represents one delta and it follows a spacy-like format (the well-known natural language processing tool), in order to indicate exactly where the span of the delta starts and ends.

In particular, we first call the function ndiff from difflib that computes the deltas between two lists of texts – in our simpler use case, we compare two single texts. For each delta that is returned, we process it in order to bring it into our desired format and then aggregate all deltas together. To make things clearer, here is a call to get_changes_in_text() with a simple example:

get_changes_in_text(  
"“This is a good decision,” said Robert Habeck, the German vice-chancellor and economy minister",
  "“This is not a good decision,” said Robert Habeck, the German vice-chancellor and economy minister.",
)

The output of the function is shown below. By comparing the second sentence with the first, the output of the script tells us that we added the word “not” to the text in position 9 (which particularly spans from position 9 until position 11), a space before this new word and finally a period at the end of the sentence. Note that positions (indices) are essentially character positions in the text, if we assume that the text is an ordered sequence of characters.

i=8, s=+ 
i=9, s=+ n 
i=10, s=+ o 
i=11, s=+ t 
i=98, s=+ .

In the next example, we see a more complex case in the German language, where two words are omitted (“sollen” and “werden”) and two words are introduced in the text (“werden” and “Anfang”), along with some punctuation changes:

get_changes_in_text(
  "Die Maßnahmen sollen Mitte Februar überprüft werden",
  "Die Maßnahmen werden Anfang Februar überprüft.",
)

i=14, s=- s
i=15, s=- o
i=16, s=- l
i=17, s=- l
i=18, s=+ w
i=19, s=+ e
i=20, s=+ r
i=21, s=+ d
i=25, s=- M
i=26, s=- i
i=27, s=- t
i=28, s=- t
i=29, s=- e
i=30, s=+ A
i=31, s=+ n
i=32, s=+ f
i=33, s=+ a
i=34, s=+ n
i=35, s=+ g
i=54, s=+ .
i=55, s=-  
i=56, s=- w
i=57, s=- e
i=58, s=- r
i=59, s=- d
i=60, s=- e
i=61, s=- n

Hence, get_changes_in_text() can be applied to find out the difference in word choices between two or more media outlets, and also where exactly these edits are located in the text.

One more Python package that can be used to compare texts in a non-strict way is fuzzywuzzy, also known as TheFuzz and used by RapidFuzz. This library compares two texts with the Levenshtein distance (the number of single-character edits necessary to change one string to the other) and also offers multiple ways to compare texts on the word (token) level. The following example uses one of their simpler functions, ratio(), and it computes the similarity between the two sentences. It returns a similarity of 97%. This aligns with what we saw above for the same pair of sentences, i.e., the only small difference between them is the word “not” and punctuation.

fuzz.ratio(
       "“This is a good decision,” said Robert Habeck, the German vice-chancellor and economy minister",
       "“This is not a good decision,” said Robert Habeck, the German vice-chancellor and economy minister.",
   )

If we wanted to make the comparison softer, we could compare the two sentences in lowercase. However, the similarity percentage for the above example does not change in this case, and generally lowercasing before comparing would make more sense if the texts were more different and complex.
Another function offered by fuzzywuzzy is the partial_ratio(), which uses the most similar sub-text between two given texts to compute their similarity. This function can be useful when we aim to compare two short texts in a more relaxed way, e.g., two articles refer to the same facts, e.g., the same public figures, even though they might not use the exact same words to describe them. For example, if we compare the two following facts with the function ratio() we get a similarity of 86%, but if we use partial_ratio() we reach a 98% similarity, which fits more with our use case. In addition, if the second text did not include the extra quote at the beginning, then the similarity would be 100%, because one of the texts is practically included in the other.

print(
   fuzz.partial_ratio(
       "“Robert Habeck, the German vice-chancellor and economy minister",
       "“the German vice-chancellor and economy minister",
   )
)

Fuzzuwuzzy has more options to compare texts based on their word differences, for instance token_set_ratio() uses ratio() to calculate the similarity after tokenizing the texts, and there is also its partial matching version partial_token_sort_ratio() as well as its sorted version token_sort_ratio(). If the order of words does not matter for a given application, then the sort methods could be appropriate, and specifically token_sort_ratio() can give high similarity scores to very similar strings but with words in different order. Furthermore, the token-based methods include some basic string pre-processing as well, with trimming/stripping, lowercasing and keeping only letters and numbers.

Conclusion

This post introduced two Python libraries for string comparison. The presented code can be used for any kind of texts, e.g., fictional or nonfiction articles, and it can be helpful for various applications, such as finding differences in the writing style of two authors or discovering the new facts that have been added to an original story. Especially difflib can become useful to data lineage and versioning, and fuzzywuzzy is great for data matching and deduplication.
A challenge to consider and investigate is whether the application deals with documents containing various alphabets, and thus whether these languages are supported by the libraries or not. Finally, in the case of a large data collection, the speed of the string comparison can be increased by installing the python-Levenshtein library and using it with fuzzywuzzy, or alternatively by using the above-mentioned library RapidFuzz that is based on fuzzywuzzy and reports faster performance. 

Background

is now

You have been redirected to the Ella Lab Website