Monday, April 1, 2019

Code-based Plagiarism Detection Techniques

Code-based buc poopeering Detection TechniquesBiraj Upadhyaya and Dr. Samarjeet BorahAbstract- The copying of architectural planming assignments by scholars speci every(prenominal)y at the undergraduate as wellspring as postgraduate level is a common practice. Efficient mechanisms for detecting plagiarised enter is therefore needed. textual matter based buccaneering detection proficiencys do not work well with source codes. In this paper we argon going to analyse a code- based buccaneering detection technique which is employed by respective(a) piracy detection nibs like JPlag, MOSS, CodeMatch and so forthIntroductionThe sacred scripture plagiarisation is derived from the Latin word plagiarie which means to kidnap or to abduct. In academicia or industry plagiarism refers to the act of copying materials without actually acknowledging the original source1. plagiarization is considered as an ethical offence which may incur serious disciplinal actions such as sharp reduction in marks and blush expulsion from the university in severe cases. Student plagiarism primarily move into two categories text-based plagiarism and code-based plagiarism. Instances of text based plagiarism includes word to word copy, paraphrasing, plagiarism of supportary sources, plagiarism of ideas, plagiarism of secondary sources, plagiarism of ideas, vocal plagiarism or authorship plagiarism etc. plagiarization is considered code based when a student copies or modifies a plan postulate to be submitted for a programming assignment. Code based plagiarism includes verbatim copying, ever-changing comments, changing blanched space and createatting, renaming identifiers, reordering code blocks, changing the order of operators/ operands in expression, changing data types, adding redundant statement or variables, replacing control structures with equivalent structures etc2.BackgroundText based plagiarism detection techniques do not work well with a coded gossip or a program. Ex periments have suggested that text based systems hack label syntax, an indispensable part of some(prenominal) programming construct thus pose a serious drawback. To over sum this problem code-based plagiarism detection techniques were actual. Code-based plagiarism detection techniques can be classified into two categories viz. Attri furthered point plagiarism detection and Structure orient plagiarism detection.Attribute oriented plagiarism detection systems measure properties of assignment submissions3. The following attributes ar considered snatch of unique operatorsNumber of unique operandsTotal identification number of occurrences of operatorsTotal number of occurrences of operandsBased on the to a higher place attributes, the degree of correspondingity of two programs can be considered.Structure oriented plagiarism detection systems deliberately ignore easily modifiable programming elements such as comments, additional white spaces and variable names. This makes this system less susceptible to addition of redundant breeding as comparabilityd to attribute oriented plagiarism detection systems. A student who is aware of this kind of plagiarism detection system beingness deployed at his institution would rather complete the assignment by himself/herself instead of working on a tedious and time consuming modification task.scalable Plagiarism DetectionSteven Burrows in his paper Efficient and potent Plagiarism Detection for Large Code Repositories3 provided an algorithm for code -based plagiarism detection. The algorithm comprises of the following stepsTokenizationFigure 1.0Let us consider a simple C programinclude int main( ) int var for (var=0 var printf(%dn, var) return 0 disconcert 1.0 Token list for program in Figure 1.0.Here ALPHANAME refers to any function name, variable name or variable value. STRING refers to double enclosed character(s).The corresponding token swarm for the program in Figure 1.0 is given asSNABjSN ordainNNJNNDDBjN A5ENBlgNlNow the above token is converted to N-gram representation. In our case the value of N is chosen as 4. The corresponding tokenization of the above token stream is shown belowSNAB NABj ABjS BjSN jSNR SNRA NRAN RANK ANKN NKNN KNNJ NNJN NJNN JNND NNDD NDDB DDBj DBjN BjNA jNA5 NA5E A5EN 5ENB ENBl NBlg BlgN lgNlThese 4-grams are generated using the sliding window technique. The sliding window technique generates N-grams by moving a window of size N crosswise all parts of the string from left to right of the token stream.The call of N-grams is an appropriate method of performing structural plagiarism detection because any change to the source code will sole(prenominal) affect a few neighbouring N-grams. The modified sport of the program will have a large percentage of unchanged N-grams, hence it will be comfy to detect plagiarism in this program .Index ConstructionThe second step is to create an upside-down index of these N-grams . An alter index consists of a lexicon and an inverted list. It is shown belowTable 2.0 Inverted IndexReferring to above inverted index for mango, we can conclude that mango occurs in three records in the collection. It occurs once in document no. 31, thrice in document no. 33 and twice in document no. 15. too we can represent our 4-gram representation of Figure 1.0 with the help of an inverted index. The inverted index for any five 4-grams is shown below in Table 3.0.Table 3.0 Inverted IndexQueryingThe next step is to query the index. It is understandable that all(prenominal) query is an N-gram representation of a program. For a token stream of t tokens, we require (t n + 1) N-grams where n is the length of the N-gram . Each query returns the ten-spot to the highest degree similar programs matching the query program and these are unionised from most similar to least similar. If the query program is one of the indexed programs, we would stomach this result to produce the highest score. We assign a similarity score of atomic number 6% to the exact or top match3. All other programs are given a similarity score relative to the top score .Burrows taste compared against an index of 296 programs shown in Table 4.0 presents the top ten results of one N-gram program file (0020.c). In this example, it is seen that the file scored against itself generates the highest relative score of 100.00%. This score is ignored, but it is utilize to generate a relative similarity score for all other results. We can also see that the program 0103.c is very similar to program 0020.c with a score of 93.34% .Rank Query Index birthday suit SimilarityFile File Score ScoreTable 4.0 Results of the program 0020.c compared to an index of 296 programs.Comparison of mingled Plagiarism Detection Tools4.1 JPlag The salient features of this tool are presented belowJPlag was authentic in 1996 by Guido MalpohlIt currently supports C, C++, C, Java, Scheme and natural language textIt is a informal plagiarism detection toolIt is us e to detect software plagiarism among multiple set of source code files.JPlag uses Greedy String cover algorithm which produces matches ranked by average and maximum similarity.It is used to compare programs which have a large variation in size which is in all likelihood the result of inserting a dead code into the program to disguise the origin.Obtained results are displayed as a set of HTML pages in a form of a histogram which presents the statistics for analyzed filesCodeMatchThe salient features of this tool are presented belowIt was developed by in 2003 by Bob Zeidman and under the licence of unspoiled CorporationThis program is available as a standalone application.It supports 26 diverse programming languages including C, C++, C, Delphi, Flash ActionScript, Java, JavaScript, SQL etcIt has a free version which allows only one trial comparison where the total of all files being examined doesnt exceed the amount of 1 megabyte of dataIt is in the main used as forensic softwar e in copyright infraction casesIt determines the most highly correlated files placed in multiple directories and subdirectories by comparing their source code .Four types of matching algorithms are used Statement Matching, Comment Matching, Instruction Sequence Matching and Identifier Matching .The results come in a form of HTML basic report that lists the most highly correlated pairs of files.MOSSThe salient features of this plagiarism detection tool are as followsThe full form of MOSS is Measure of Software SimilarityIt was developed by Alex Aiken in 1994It is provided as a free meshing attend to hosted by Stanford University and it can be used only if a substance abuser creates an accountThe program can analyze source code pen in 26 programming languages including C, C++, Java, C, Python, Pascal, Visual Basic, Perl etc.Files are submitted through the command line and the processing is performed on the Internet serverThe current form of a program is available only for the UNIX platformsMOSS uses Winnowing algorithm based on code-sequence matching and it analyses the syntax or the structure of the observed filesMOSS maintains a database that stores an internal representation of programs and then looks for similarities surrounded by themComparative Analysis TableConclusionIn this paper we learnt a structured code-based plagiarism technique known as Scalable Plagiarism Detection. Various processes like tokenization, indexing and query-indexing were also studied. We also studied various salient features of various code-based plagiarism detection tools like JPlag, CodeMatch and MOSS.ReferencesGerry McAllister, Karen Fraser, Anne Morris, Stephen Hagen, Hazel ashen http// Cosma , An Approach to Source-Code Plagiarism Detection and Investigation victimization Latent Semantic Analysis , University of Warwick, Department of Computer attainment, July 2008Steven Burrows, Efficient and Effective Plagi arism Detection for Large Code Repositories, School of Computer Science and Information Technology , Melbourne, Australia, October 2004Vedran Juric, Tereza Juric and Marija Tkalec ,Performance Evaluation of Plagiarism Detection method Based on the Intermediate Language , University of Zagreb

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.