The aim in this project is to develop and test techniques for detecting plagiarism (copying) in code submitted for computer programming assignments. The standard approach to plagiarism detection measures the similarity between two documents (two submitted pieces of code) in terms of common sequences of words which occur in both: the greater the number of common sequences of words, the greater the chance that one document was copied from the other. Systems using this approach typically ignore all whitespaces (see http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf ). A difficulty with this approach, especially when attempting to detect plagiarism in short computer programming assignments (e.g. a page or two of code) is that two students attempting to write code for the same assignment may necessarily produce many common sequences of words because they are both working on the same assignment. High levels of similarity between two programs submitted for the same assignment may not indicate plagiarism, but may simply indicate that there are only a small number of ways of completing that assignment.
This project will test a different approach to plagiarism detection. In this approach, similarity between two submitted programs will be measured in terms of the presence of unusual patterns of whitespace characters in both programs. For example, if a given line in one program is indented by a tab character, then two space characters, then a tab character, and if some corresponding line is indented by the same pattern of whitespace characters, that would count as evidence for plagiarism. Similarly, if one program contained two spaces between one given word and the next, and the other program also contained two spaces between two corresponding words, that again would count as evidence for plagiarism. The advantage with evidence of this sort is that it cannot be explained away as a consequence of two programs being written for the same assignment.