Another related and also important operation is to find the minimum edit alignment; that is, once the minimum edit distance between the two strings is found, output the sequence of operations that can be used to change the one string into the other. For example, if we let C mean "correct", S mean "substitution", D mean "deletion" and I mean "insertion", then the edit alignment between the characters in "construction" and "distortions" would be ISSCCISSCCCCD. Here is an explanation of the alignment:
- I: Insert a "c" ->cdistortions
- S: Substitute "d" for "o" ->coistortions
- S: Substitute "i" for "n" -> constortions
- CC: Leave the "st" alone
- I: Insert "r" -> constrortion
- S: Substitute "u" for "o" -> contrurtion
- S: Substitute "c" for "r" -> constructions
- CCCC: Leave "tion" alone
- D: delete "s" -> construction
/** * @return List of Operations representing allignment between list1 and * list2. The allignment represents operations to change list2 into list1. */ public static ListFor word level alignment, you can call it on Strings using the split function like so:levenshteinAllignment(Object[] list1, Object[] list2) { int[][] distanceMatrix = getDistanceMatrix(list1, list2); List ops= new ArrayList ( list1.length > list2.length ? list1.length : list2.length); //think of distance chart as going from bottom left to top right; //current position coordinates; start at top right. int row = list1.length; int col = list2.length; //could have moved to current position from three others; store their scores here. int diag; int left; int below; int current; while (row != 0 || col != 0) { diag = getVal(row-1,col-1,distanceMatrix); left = getVal(row,col-1,distanceMatrix); below = getVal(row-1,col,distanceMatrix); current = distanceMatrix[row][col]; // if the value in the diagonal cell (going up+left) is smaller or equal to the // values found in the other two cells // AND // if this is same or 1 minus the value of the current cell if(diag <= left && diag <= below && (diag == current || diag == current - 1)){ // then "take the diagonal cell" // if the value of the diagonal cell is one less than the current cell: if(diag == current - 1) // Add a SUBSTITUTION operation (from the letters corresponding to // the _current_ cell) ops.add(new Operation(Operation.Type.SUBSTITUTION,list1[row-1],list2[col-1])); else // otherwise: do not add an operation this was a no-operation. ops.add(new Operation(Operation.Type.CORRECT,list1[row-1])); //move diagonally row--; col--; } // // elseif the value in the cell to the left is smaller or equal to the value of // the cell below current cell // AND // if this value is same or 1 minus the value of the current cell else if(left < below && (left == current || left == current - 1)){ // add an INSERTION of the cell to the left ops.add(new Operation(Operation.Type.INSERTION,list2[col-1])); //move left col--; } // else else{ // take the cell below, add // Add a DELETION operation ops.add(new Operation(Operation.Type.DELETION,list1[row-1])); //move down row--; } } Collections.reverse(ops); return ops; } private static int getVal(int row, int col, int[][] distanceMatrix){ if(row < 0 || row > distanceMatrix.length) return Integer.MAX_VALUE; if(col < 0 || col > distanceMatrix.length) return Integer.MAX_VALUE; else return distanceMatrix[row][col]; } public static class Operation{ private Type type; private Object object1; private Object object2; public enum Type{ CORRECT,SUBSTITUTION,DELETION,INSERTION } public Operation(Type t, Object o1){ type = t; object1 = o1; object2 = null; } public Operation(Type t, Object o1, Object o2){ type = t; object1 = o1; object2 = o2; } @Override public String toString(){ if(type == Type.SUBSTITUTION) return "Substitution: " + object1.toString() + " -> " + object2.toString(); if(type == Type.CORRECT) return "Correct: " + object1.toString(); if(type == Type.DELETION) return "Deletion: " + object1.toString(); if(type == Type.INSERTION) return "Insertion: " + object1.toString(); return null; } } /** * * @param array of objects to compare * @param array of objects to compare * @return Levenshtein distance between arrays. * This method uses the equals(Object o) method to compare the * objects in the two arrays, returning the Levenshtein distance between them. */ public static int levenshteinDistance(Object[] list1, Object[] list2) { int[][] distance = getDistanceMatrix(list1, list2); return distance[list1.length][list2.length]; } /** * * @param list1 * @param list2 * @return A completely filled distance matrix; movement from [i-1][j] * represents insertion, from [i][j-1] represents deletion, and from * [i-1][j-1] represents substitution or no operation. */ private static int[][] getDistanceMatrix(Object[] list1, Object[] list2) { int[][] distanceMatrix = new int[list1.length + 1][list2.length + 1]; for (int i = 0; i <= list1.length; i++) distanceMatrix[i][0] = i; for (int j = 0; j <= list2.length; j++) distanceMatrix[0][j] = j; for (int i = 1; i <= list1.length; i++) for (int j = 1; j <= list2.length; j++) distanceMatrix[i][j] = minimum(distanceMatrix[i - 1][j] + 1,// insertion distanceMatrix[i][j - 1] + 1,// deletion distanceMatrix[i - 1][j - 1]// substitution or correct + ((list1[i - 1].equals(list2[j - 1])) ? 0 : 1)); return distanceMatrix; } /** * Same as Math.min, but returns the minimum of three arguments instead of * two. */ private static int minimum(int a, int b, int c) { return Math.min(Math.min(a, b), c); }
for(Operation o : levenshteinAllignment( "What My house gleams with the light of the moon and your face" .split(" "), "Your house with the light of the the moon and my face" .split(" ")) ) System.out.println(o);And the output would be:
Deletion: What
Substitution: My -> Your
Correct: house
Deletion: gleams
Correct: with
Correct: the
Correct: light
Correct: of
Correct: the
Insertion: the
Correct: moon
Correct: and
Substitution: your -> my
Correct: face
Feel free to use and edit this as you like. Many applications disregard the strings that are correct and only output the edit operations, and that should be an easy edit.
No comments:
Post a Comment