~ubuntu-branches/ubuntu/oneiric/similarity-tester/oneiric

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
.\"	This file is part of the software similarity tester SIM.
.\"	Written by Dick Grune, Vrije Universiteit, Amsterdam.
.\"	$Id: sim.1,v 2.7 2008/09/23 09:07:12 dick Exp $
.\"
.TH SIM 1 2001/11/13 "Vrije Universiteit"
.SH NAME
sim \- find similarities in C, Java, Pascal, Modula-2, Lisp, Miranda or text files
.SH SYNOPSIS
.B sim_c
[
.B \-[defFinpsST]
.B \-r
.I N
.B \-t
.I N
.B \-w
.I N
.B \-o
.I F
]
file ... [
.B /
[ file ... ] ]
.br
.B sim_c
\&...
.br
.B sim_java
\&...
.br
.B sim_pasc
\&...
.br
.B sim_m2
\&...
.br
.B sim_lisp
\&...
.br
.B sim_mira
\&...
.br
.B sim_text
\&...
.br
.SH DESCRIPTION
.I Sim_c
reads the C files
.I file ...
and looks for pieces of text that are similar; two pieces of program text
are similar if they only differ in layout, comment, identifiers and
the contents of numbers, strings and characters.
If any runs of sufficient length
are found, they are reported on standard output; the number of significant
tokens in the run is given between square brackets.
.PP
.I Sim_java
does the same for Java,
.I sim_pasc
for Pascal,
.I sim_m2
for Modula-2,
.I sim_lisp
for Lisp, and
.I sim_mira
for Miranda.
.I Sim_text
works on arbitrary text; it is occasionally useful on shell scripts.
.PP
The program can be used for finding copied pieces of code in
purportedly unrelated programs (with
.B \-s
or
.BR \-S ),
or for finding accidentally duplicated code in larger projects (with
.BR \-f ).
.PP
If a
.B /
is present between the input files, the latter are divided into a group of
"new" files (before the
.BR / )
and a group of "old" files; if there is no
.BR / ,
all files are "new".
Old files are never compared to each other.
Since the similarity tester
reads the files several times, it cannot read from standard input.
(See, however, the
.B \-i
option.)
.PP
There are the following options:
.TP
.B \-d
The output is in a diff(1)-like format instead of the default
2-column format.
.TP
.B \-e
Each file is compared to each file in isolation; this will find all
similarities between all texts involved, regardless of duplicates.
.TP
.B \-f
Runs are restricted to pieces with balancing parentheses, to isolate
potential functions (C, Java, Pascal, Modula-2 and Lisp only).
.TP
.B \-F
The names of functions in calls are required to match exactly
(C, Java, Pascal, Modula-2 and Lisp only).
.TP
.B \-i
The names of the files to be compared are read from standard input, including
a possible
.B /
; the file names need to be separated by layout.
This allows a very large number of file names to be specified; it differs from
the @ facility provided by some compilers in that it handles file names only,
and does not recognize option arguments.
.TP
.B \-n
Similarities found are only summarized, not displayed.
.TP
.B "\-o F"
The output is written to the file named
.I F.
.TP
.B \-p
The output is given in similarity percentages; see below.
.TP
.B "\-r N"
The minimum run length is set to
.I N
(default is
.I N
= 24).
.TP
.B \-s
The contents of a file are not compared to itself (\-s for "not self").
.TP
.B \-S
The contents of the new files are compared to the old files only \- not
between themselves.
.TP
.B "\-t N"
In combination with the
.B \-p
option, sets the threshold (in percents) below which similarities will not be
reported.
.TP
.B \-T
A more terse and uniform form of output is produced, which may be more
suitable for postprocessing.
.TP
.B "\-w N"
The page width used is set to
.I N
columns (default is
.I N
= 80).
.PP
The
.B \-p
option results in lines of the form
.DS
.ft 5
F consists for x % of G material
.ft P
.DE
meaning that \f5x\fP % of \f5F\fP's text can also be found in \f5G\fP.
Note that this relation is not symmetric; it is in fact quite possible for one
file to consist for 100 % of text from another file, while the other file
consists for only 1 % of text of the first file, if their lengths differ
enough.
A threshold can be set using the
.B \-T
option.
Note also that the granularity of the recognized text is still governed by the
.B \-r
option or its default.
.PP
Care has been taken to keep all internal processes linear in the length of the
input, with the exception of the matching process which is almost linear,
using a hash table; various other tables are used for speed-up.
If, however, there is not enough memory for the tables, they are discarded in
order of unimportance, under which conditions the algorithms revert to their
quadratic nature.
.SH AUTHOR
Dick Grune, Vrije Universiteit, Amsterdam.
.SH BUGS
Strong periodicity in the input text (like a table of
.I N
almost identical lines) causes problems.
.I Sim
tries to cope with this but cannot avoid giving appr.\&
.I log N
messages about it.
The best advice is still to take the offending files out of the game.
.PP
Since it uses
.I lex(1)
on some systems, it may dump core on any weird construction that overflows
.IR lex 's
internal buffers.