The ability to automatically cluster large collections of noisy form images according to form type would improve the efficiency of organizations that currently do this by hand. Some noisy form collections contain form types that are structurally very similar, but should cluster apart. To address this issue, we propose CONFIRM - Clustering Of Noisy Form Images using Robust Metrics. CONFIRM uses a novel technique to match form text and rule lines to create vector representations of each form. A Random Forest classifier is then used to learn a pairwise similarity metric for use in Spectral Clustering. Validation is provided on the NIST tax forms as well as several historical forms datasets.
Early Version of Master’s thesis.