1
.TH AGREP 1 "Jan 17, 1992"
3
agrep \- search a file for a string or regular expression, with approximate matching capabilities
7
.B \-#cdehiklnpstvwxBDGIS
14
.IR filename ".\|.\|. ]"
19
(standard input is the default, but see a warning under LIMITATIONS)
20
for records containing strings which either
21
\fIexactly\fP or \fIapproximately\fP match a pattern.
22
A record is by default a line, but it can be defined differently using
23
the \-d option (see below).
24
Normally, each record found is copied to the standard output.
25
Approximate matching allows finding records that contain the pattern
26
with several errors including substitutions, insertions, and
28
For example, Massechusets matches Massachusetts with two errors
29
(one substitution and one insertion). Running
31
\-2 Massechusets foo outputs all lines in foo containing any string with
32
at most 2 errors from Massechusets.
35
supports many kinds of queries including
36
arbitrary wild cards, sets of patterns, and in general,
39
It supports most of the options supported by the
41
family plus several more (but it is not 100% compatible with grep).
42
For more information on the algorithms used by agrep see
44
"Fast Text Searching With Errors,"
45
Technical report #91-11, Department of Computer Science, University
46
of Arizona, June 1991 (available by anonymous ftp from cs.arizona.edu
47
in agrep/agrep.ps.1), and
49
"Agrep -- A Fast Approximate Pattern Searching Tool",
50
To appear in USENIX Conference 1992 January (available by anonymous ftp
51
from cs.arizona.edu in agrep/agrep.ps.2).
53
As with the rest of the \fBgrep\fP family, the characters
66
can cause unexpected results when included in the
68
as these characters are also meaningful
69
to the shell. To avoid these problems, one should always enclose the entire
70
pattern argument in single quotes, i.e., 'pattern'.
71
Do not use double quotes (").
75
is applied to more than one input
76
file, the name of the file is displayed
77
preceding each line which matches
78
the pattern. The filename is not displayed
79
when processing a single
80
file, so if you actually want the filename
83
as a second file in the list.
87
\fI#\fP is a non-negative integer (at most 8)
88
specifying the maximum number of errors
89
permitted in finding the approximate matches (defaults to zero).
90
Generally, each insertion, deletion, or substitution counts as one error.
91
It is possible to adjust the relative cost of insertions,
92
deletions and substitutions (see \-I \-D and \-S options).
95
Display only the count of matching records.
97
.B \-d "'\fIdelim\fP'"
98
Define \fIdelim\fP to be the separator between two records.
99
The default value is '$', namely a record is by default
101
\fIdelim\fP can be a string of size at most 8
102
(with possible use of ^ and $), but not
103
a regular expression.
104
Text between two \fIdelim\fP's, before the first \fIdelim\fP,
105
and after the last \fIdelim\fP is considered as one record.
106
For example, \-d '$$' defines paragraphs as records and \-d '^From\ '
107
defines mail messages as records.
109
matches each record separately.
110
This option does not currently work with regular expressions.
115
argument, but useful when the
120
.BI \-f " patternfile"
122
contains a set of (simple) patterns.
123
The output is all lines that match at least one of the patterns in
125
Currently, the \-f option works only for exact match and for simple
126
patterns (any meta symbol is interpreted as a regular character);
127
it is compatible only with \-c, \-h, \-i, \-l, \-s, \-v, \-w, and \-x options.
128
see LIMITATIONS for size bounds.
131
Do not display filenames.
134
Case-insensitive search \(em e.g., "A" and "a" are considered equivalent.
137
No symbol in the pattern is treated as a meta character.
138
For example, agrep \-k 'a(b|c)*d' foo will find
139
the occurrences of a(b|c)*d in foo whereas agrep 'a(b|c)*d' foo
140
will find substrings in foo that match the regular expression 'a(b|c)*d'.
143
List only the files that contain a match.
144
This option is useful for looking for files containing a certain pattern.
145
For example, " agrep \-l 'wonderful' * " will list the names of those
146
files in current directory that contain the word 'wonderful'.
149
Each line that is printed is prefixed by its record number in the file.
152
Find records in the text that contain a supersequence of the pattern.
154
\fB agrep \-p DCS foo
155
will match "Department of Computer Science."
158
Work silently, that is, display nothing except error messages.
159
This is useful for checking the error status.
162
Output the record starting from the end of
164
to (and including) the next
166
This is useful for cases where
168
should come at the end of the record.
171
Inverse mode \(em display only those records that
176
Search for the pattern as a word \(em i.e., surrounded by non-alphanumeric
177
characters. The non-alphanumeric
179
surround the match; they cannot be counted as errors.
182
\-w \-1 car will match cars, but not characters.
185
The pattern must match the whole line.
188
Used with \-B option. When \-y is on, agrep will always
189
output the best matches without giving a prompt.
193
When \-B is specified and no exact matches are found, agrep
194
will continue to search until the closest matches (i.e., the ones
195
with minimum number of errors)
196
are found, at which point the following message will be shown:
197
"the best match contains x errors, there are y matches, output them? (y/n)"
198
The best match mode is not supported for standard input, e.g.,
200
When the \-#, \-c, or \-l options are specified, the \-B option is ignored.
201
In general, \-B may be slower than \-#, but not by very much.
204
Set the cost of a deletion to \fIk\fP (\fIk\fP is a positive integer).
205
This option does not currently work with regular expressions.
208
Output the files that contain a match.
211
Set the cost of an insertion to \fIk\fP (\fIk\fP is a positive integer).
212
This option does not currently work with regular expressions.
215
Set the cost of a substitution to \fIk\fP (\fIk\fP is a positive integer).
216
This option does not currently work with regular expressions.
221
supports a large variety of patterns, including simple
222
strings, strings with classes of characters, sets of strings,
223
wild cards, and regular expressions.
226
any sequence of characters, including the special symbols
227
`^' for beginning of line and `$' for end of line.
228
The special characters listed above (
240
) should be preceded by `\\' if they are to be matched as regular
241
characters. For example, \\^abc\\\\ corresponds to the string ^abc\\,
242
whereas ^abc corresponds to the string abc at the beginning of a
245
\fBClasses of characters\fP
246
a list of characters inside [] (in order) corresponds to any character
247
from the list. For example, [a-ho-z] is any character between a and h
248
or between o and z. The symbol `^' inside [] complements the list.
249
For example, [^i-n] denote any character in the character set except
250
character 'i' to 'n'.
251
The symbol `^' thus has two meanings, but this is consistent with
253
The symbol `.' (don't care) stands for any symbol (except for the
256
\fBBoolean operations\fP
258
supports an `and' operation `;'
259
and an `or' operation `,',
260
but not a combination of both. For example, 'fast;network' searches
261
for all records containing both words.
264
The symbol '#' is used to denote a wild card. # matches zero or any
265
number of arbitrary characters. For example,
266
ex#e matches example. The symbol # is equivalent to .* in egrep.
267
In fact, .* will work too, because it is a valid regular expression
268
(see below), but unless this is part of an actual regular expression,
271
\fBCombination of exact and approximate matching\fP
272
any pattern inside angle brackets <> must match the text exactly even
273
if the match is with errors. For example, <mathemat>ics matches
274
mathematical with one error (replacing the last s with an a), but
275
mathe<matics> does not match mathematical no matter how many errors we
278
\fBRegular expressions\fP
279
The syntax of regular expressions in \fBagrep\fP is in general the same as
280
that for \fBegrep\fP. The union operation `|', Kleene closure `*',
281
and parentheses () are all supported.
282
Currently '+' is not supported.
283
Regular expressions are currently limited to approximately 30
284
characters (generally excluding meta characters). Some options
285
(\-d, \-w, \-f, \-t, \-x, \-D, \-I, \-S) do not
286
currently work with regular expressions.
287
The maximal number of errors for regular expressions that use '*'
292
agrep \-2 \-c ABCDEFG foo
293
gives the number of lines in file foo that contain ABCDEFG
296
agrep \-1 \-D2 \-S2 'ABCD#YZ' foo
297
outputs the lines containing ABCD followed, within arbitrary
298
distance, by YZ, with up to one additional insertion
299
(\-D2 and \-S2 make deletions and substitutions too "expensive").
301
agrep \-5 \-p abcdefghij /usr/dict/words
302
outputs the list of all words containing at least 5 of the first 10
303
letters of the alphabet \fIin order\fR. (Try it: any list starting
304
with academia and ending with sacrilegious must mean something!)
306
agrep \-1 'abc[0-9](de|fg)*[x-z]' foo
307
outputs the lines containing, within up to one error, the string
308
that starts with abc followed by one digit, followed by zero or more
309
repetitions of either de or fg, followed by either x, y, or z.
311
agrep \-d '^From\ ' 'breakdown;internet' mbox
312
outputs all mail messages (the pattern '^From\ ' separates mail messages
313
in a mail file) that contain keywords 'breakdown' and 'internet'.
315
agrep \-d '$$' \-1 '<word1> <word2>' foo
316
finds all paragraphs that contain word1 followed by word2 with one
317
error in place of the blank.
318
In particular, if word1 is the last word in a line and word2
319
is the first word in the next line, then the space will be
320
substituted by a newline symbol and it will match.
321
Thus, this is a way to overcome separation by a newline.
322
Note that \-d '$$' (or another delim which spans more than one line)
323
is necessary, because otherwise agrep searches
324
only one line at a time.
326
agrep '^agrep' <this manual>
327
outputs all the examples of the use of agrep in this man pages.
336
Any bug reports or comments will be appreciated!
337
Please mail them to sw@cs.arizona.edu or udi@cs.arizona.edu
339
Regular expressions do not support the '+' operator (match 1 or more
340
instances of the preceding token). These can be searched for by using
341
this syntax in the pattern:
344
\&'\fIpattern\fB(\fIpattern\fB)*\fR'
347
(search for strings containing one instance of the pattern, followed by 0 or
348
more instances of the pattern).
350
The following can cause an infinite loop:
352
pattern * > output_file.
353
If the number of matches is high, they may be deposited in
354
output_file before it is completely read leading to more matches of
355
the pattern within output_file (the matches are against the whole
356
directory). It's not clear whether this is a "bug" (grep will do the
357
same), but be warned.
359
The maximum size of the
361
is limited to be 250Kb, and the maximum number of patterns
362
is limited to be 30,000.
364
Standard input is the default if no input file is given.
365
However, if standard input is keyed in directly (as opposed to through
366
a pipe, for example) agrep may not work for some non-simple patterns.
368
There is no size limit for simple patterns.
369
More complicated patterns are currently limited to approximately 30 characters.
370
Lines are limited to 1024 characters.
371
Records are limited to 48K, and may be truncated if they are larger
373
The limit of record length can be
374
changed by modifying the parameter Max_record in agrep.h.
376
Exit status is 0 if any matches are found,
377
1 if none, 2 for syntax errors or inaccessible files.
379
Sun Wu and Udi Manber, Department of Computer Science, University of
380
Arizona, Tucson, AZ 85721. {sw|udi}@cs.arizona.edu.