中文

About the Theorem

Q1.1 What is the algorithm of document comparison?
A: The main methodology of document comparison is the text-similarity evaluation algorithm developed by Nation Chiao Tung University. For each sentence in the evaluated document, find the most similar sentence in the compared source document, and find the common-word subsequences in the sentence pairs The similarity percentage of documents is obtained by counting the words of the common-word subsequences, it denotes the proportion of suspected plagiarism.

Please refer to:
A Hybrid Methodology of Effective Text-Similarity Evaluation
Shu-Kai Yang and Chien Chou, National Yang Ming Chiao Tung University.
ICS 2018: New Trends in Computer Technologies and Applications pp 227-237

Q1.2 What document language is suitable for comparison?
A: Currently only the comparisons of the documents in English, in Traditional Chinese, and in the hybrid of both languages are supported by this system.

Q1.3 How does the software detect plagiarism?
A: In addition to copying the paragraphs, the tricks of plagiarism includes partial rewriting and rearragement of sentences and paragraphs. A plagiarizer may place the stolen pieces in any part of his works. We can not assume that a plagiarism is just the duplication and abridgement of documents.

All the mentioned tricks of plagiarism will be detected by this system, unless the plagiarizer has rewritten every words of the documents. Compared with some document differencing software, such as the built-in comparison function of Microsoft Word, in which documents are determined different if some sentences or paragraphs are rearranged, this system can detect more plagiarism instances.

Q1.4 Why does the system evaluate the text similarity? How to do that?
A: A practical indicator of text similarity has better to indicate the proportion of plagiarism in the evaluated document, such as how many sentences or words are duplicated from the source documents. The indicator of the system is defined as the percentage of words that are detected duplications from other documents. The calculation is the count of detected words divided by the total count of words of the evaluated document. The result percentage is the text similarity of the evaluated document relative to the compared originals.

Q1.5 Why the calculation of text similarity is based on word count, not sentence count?
A: Some existing services of plagiarism detection are based the imitative sentence count, but the system calculates the similarity based on the duplicated word count for three reasons. The first reason is the difference of algorithms. Lots of existing services segment the compared documents into sentences, then counts the continuous words that two sentences have in common. If two sentences of the compared documents has more than 3-5 words in common, the sentences are paired "similar". The proportion of similar sentences is defined as the text similarity of two documents by the existing services. But this system evaluates the text similarity by counting the words in the common subsequences of sentences.

The second reason is, evaluations like the existing services is not as precise as the way this system does by counting the words. In the evaluations based on imitative sentence counting, the pair of sentences which are completly duplicated and the pair of sentences which have a few words in common are treated the same. It cause noticeable inaccuracy of text similarity evaluation. Instead of counting the similar sentences, this system evaluates the text similarity by words. It reflects the magnitude of plagiarism much more precisely.

The third reason is practical. When a document is segmented into sentences, there are many cases that causes segmentation errors, such as the sentences that end with newline characters instead of punctuation marks, the sentences contain a unknown abbreviation point that is misunderstood as a period. Those segmentation errors are unavoidable, and they will cause noticeable inaccuracy upon the text similarity indicator if the evaluation is based on counting the imitative sentences. However, the evaluation based on word counting is hardly affected. For the three reasons, this system calculates the text similarity based on word count.

Q1.6 In the shown similar parts of documents (colored in red) after comparison, why are a few words of sentences marked sometimes, not the whole sentences?
A: The user interface of the system is designed to display the common parts between documents. When the sentences are modified from the source documents, only the unmodified parts are colored in red.

Q1.7 Why does the text similarity percentage changes when I switch the compared documents on the left and right sides?
A: On the left side is the document to be evaluated that what percentage of it is plagiarized from the document(s) on the right side. On the right side is the source document(s) to be compared. The calculation of text similarity percentage is the count of words that are plagiarized from the right-side document(s) divided by the total count of words of the left-side document. Switching the documents on the left and right side is equivalent to switch the positions of documents that are examined and referred. Of course the text similarity percentage changes.

About the System Manipulation


Q2.1 What does item "(Compared with All Docs)" do?
A: The evaluated document on the left side may plagiarize multiple documents listed on the right side. This system supports one-to-many comparison. Selecting the item "(Compared with All Docs)" and clicking the button "Compare...", or just clicking the button "Compare with All Docs", the result is the union of the comparison between the left-side document and each document in the right-side list. Once a sentence is detected similar with any one sentence in the right-side list, it's common words are colored red. The text-similarity percentage is the proportion of red words to the left-side document. In other words, text-similarity percentage of "(Compared with All Docs)" is the proportion to the evaluated document that plagiarizes the listed documents.

Q2.2 When the system is analyzing a document, what matters if the sentence segmentation contains errors?
A: When you click the sentences in the main window of the system, sometimes you find some sentences are segmented incorrectly because of misunderstanding the abbreviation points or lack of punctuation marks. The sentence segmentation errors are unavoidable, but hardly affect the result of text-similarity evaluation of this system. Sentences are the basic units of common word subsequence finding. The similar parts of documents to be found is words, and the calculation of percentage is based on word counting. Wrong segmentation of sentences does not affect the result obviously.

Q2.3 When the system is analyzing a document, what matters if some sentence are categorized incorrectly?
A: If the text summarized from the referred document is properly quoted, it should not be treated as plagiarism. When the system is used to detect plagiarism, there are options to exclude the quotation and reference parts of the compared documents.

The system treats the sentences that are surrounded by quotation marks as quotation parts. The system also treats the sentences after sub-title "References" as the reference list of the document. Both quotation and reference parts of documents will not be compared. If the text does not wrap the quotation text in quotation marks properly, or does not set the sub-title of reference section, the text will not be recognized as quotations and references correctly. When such situation happens, you have to correct the categories of sentences manually.

In this situation, please click the sentence which is wrongly categorized, then press the right button of the mouse or F2 key of the keyboard. The dialog for you to change the category of clicked sentence shows, choose the right category and click OK. After the sentences are categorized correctly, click button "Compare..." to compare the documents again.

Q2.4 Can I just compare a specific chapter of the document, not the whole document?
A: No, the system is designed to do file-to-file comparison. If you want to compare the specific chapter, please save the chapter as individual file.

About File Formats


Q3.1 Can the system compares the figure or scanned image in documents?
A: No, the system can only analyze the text parts of document. If the document consists of scanned image pages, please transform it into text files with OCR. Analogously, the system can not analyze the non-text content of documents such as mathematics formulas, tables, charts, and embedded objects.

Q3.2 Are there some Microsoft Word files that can not be read?
A: The system uses third-party libraries NPOI and NPOI.HWPF to import Microsoft Word files. So the system can read .docx and .doc files without Microsoft Office pre-installed on the computer. It libraries are not natively developed by Microsoft Corp., so sometimes there are special files can not be read correctly. In such situation, please transform the files into PDF files before comparison.

Q3.3 Why does the window blank after reading some Microsoft Word files?
A: The system handles Microsoft Word encoded in Traditional Chinese and English. The files encoded in other codepages, e.g. Simplified Chinese, can not be handled by the system. In such situation, please transform the files into PDF files before comparison.

Q3.4 Why the heading styles of the document title and sub-titles missed sometimes?
A: When the system imports a document, it has not acquired the heading information based on the file formats, especially the PDF file format. It only got the fonts, sizes, styles, and locations of text pieces. The system recognizes the title and sub-titles of the larger font sizes and bolder styles, of course, the system may guess it wrong. However, it does not affect the result of document comparison.

Trouble Shooting


Q4.1 The system can not start after installation.
A: The system requires .NET Framework 4.72 (or above version) in runtime. It is built in Windows 10. If you are running the system on Windows 7, you have to download the latest .NET Framework from Micrsoft Download Center.