11527
11527
@node Variable Typing
11528
11528
@subsubsection String Type versus Numeric Type
11530
Scalar objects in @command{awk} (variables, array elements, and fields)
11531
are @emph{dynamically} typed. This means their type can change as the
11532
program runs, from @dfn{untyped} before any use,@footnote{@command{gawk}
11533
calls this @dfn{unassigned}, as the following example shows.} to string
11534
or number, and then from string to number or number to string, as the
11535
program progresses.
11537
You can't do much with untyped variables, other than tell that they
11538
are untyped. The following program tests @code{a} against @code{""}
11539
and @code{0}; the test succeeds when @code{a} has never been assigned
11540
a value. It also uses the built-in @code{typeof()} function
11541
(not presented yet; @pxref{Type Functions}) to show @code{a}'s type:
11544
$ @kbd{gawk 'BEGIN @{ print (a == "" && a == 0 ?}
11545
> @kbd{"a is untyped" : "a has a type!") ; print typeof(a) @}'}
11546
@print{} a is untyped
11547
@print{} unassigned
11550
A scalar has numeric type when assigned a numeric value,
11551
such as from a numeric constant, or from another scalar
11555
$ @kbd{gawk 'BEGIN @{ a = 42 ; print typeof(a)}
11556
> @kbd{b = a ; print typeof(b) @}'}
11561
Similarly, a scalar has string type when assigned a string
11562
value, such as from a string constant, or from another scalar
11566
$ @kbd{gawk 'BEGIN @{ a = "forty two" ; print typeof(a)}
11567
> @kbd{b = a ; print typeof(b) @}'}
11572
So far, this is all simple and straightforward. What happens, though,
11573
when @command{awk} has to process data from a user? Let's start with
11574
field data. What should the following command produce as output?
11577
echo hello | awk '@{ printf("%s %s < 42\n", $1,
11578
($1 < 42 ? "is" : "is not")) @}'
11582
Since @samp{hello} is alphabetic data, @command{awk} can only do a string
11583
comparison. Internally, it converts @code{42} into @code{"42"} and compares
11584
the two string values @code{"hello"} and @code{"42"}. Here's the result:
11587
$ @kbd{echo hello | awk '@{ printf("%s %s < 42\n", $1,}
11588
> @kbd{ ($1 < 42 ? "is" : "is not")) @}'}
11589
@print{} hello is not < 42
11592
However, what happens when data from a user @emph{looks like} a number?
11593
On the one hand, in reality, the input data consists of characters, not
11595
values. But, on the other hand, the data looks numeric, and @command{awk}
11596
really ought to treat it as such. And indeed, it does:
11599
$ @kbd{echo 37 | awk '@{ printf("%s %s < 42\n", $1,}
11600
> @kbd{ ($1 < 42 ? "is" : "is not")) @}'}
11601
@print{} 37 is < 42
11604
Here are the rules for when @command{awk}
11605
treats data as a number, and for when it treats data as a string.
11530
11607
@cindex numeric, strings
11531
11608
@cindex strings, numeric
11532
11609
@cindex POSIX @command{awk}, numeric strings and
11533
The POSIX standard introduced
11534
the concept of a @dfn{numeric string}, which is simply a string that looks
11535
like a number---for example, @code{@w{" +2"}}. This concept is used
11536
for determining the type of a variable.
11537
The type of the variable is important because the types of two variables
11610
The POSIX standard uses the term @dfn{numeric string} for input data that
11611
looks numeric. The @samp{37} in the previous example is a numeric string.
11612
So what is the type of a numeric string? Answer: numeric.
11614
The type of a variable is important because the types of two variables
11538
11615
determine how they are compared.
11539
Variable typing follows these rules:
11616
Variable typing follows these definitions and rules:
11542
11618
@itemize @value{BULLET}
11552
11628
Fields, @code{getline} input, @code{FILENAME}, @code{ARGV} elements,
11553
11629
@code{ENVIRON} elements, and the elements of an array created by
11554
11630
@code{match()}, @code{split()}, and @code{patsplit()} that are numeric
11555
strings have the @dfn{strnum} attribute. Otherwise, they have
11631
strings have the @dfn{strnum} attribute.@footnote{Thus, a POSIX
11632
numeric string and @command{gawk}'s strnum are the same thing.}
11633
Otherwise, they have
11556
11634
the @dfn{string} attribute. Uninitialized variables also have the
11557
11635
@dfn{strnum} attribute.
11696
11774
In short, when one operand is a ``pure'' string, such as a string
11697
11775
constant, then a string comparison is performed. Otherwise, a
11698
11776
numeric comparison is performed.
11700
This point bears additional emphasis: All user input is made of characters,
11701
and so is first and foremost of string type; input strings
11702
that look numeric are additionally given the strnum attribute.
11777
(The primary difference between a number and a strnum is that
11778
for strnums @command{gawk} preserves the original string value that
11779
the scalar had when it came in.)
11781
This point bears additional emphasis:
11782
Input that looks numeric @emph{is} numeric.
11783
All other input is treated as strings.
11703
11785
Thus, the six-character input string @w{@samp{ +3.14}} receives the
11704
11786
strnum attribute. In contrast, the eight characters
11705
11787
@w{@code{" +3.14"}} appearing in program text comprise a string constant.
31285
31379
The API defines several simple @code{struct}s that map values as seen
31286
31380
from @command{awk}. A value can be a @code{double}, a string, or an
31287
31381
array (as in multidimensional arrays, or when creating a new array).
31288
31383
String values maintain both pointer and length, because embedded @sc{nul}
31289
31384
characters are allowed.
31291
31386
@quotation NOTE
31292
By intent, strings are maintained using the current multibyte encoding (as
31293
defined by @env{LC_@var{xxx}} environment variables) and not using wide
31294
characters. This matches how @command{gawk} stores strings internally
31295
and also how characters are likely to be input into and output from files.
31387
By intent, @command{gawk} maintains strings using the current multibyte
31388
encoding (as defined by @env{LC_@var{xxx}} environment variables)
31389
and not using wide characters. This matches how @command{gawk} stores
31390
strings internally and also how characters are likely to be input into
31391
and output from files.
31395
String values passed to an extension by @command{gawk} are always
31396
@sc{NUL}-terminated. Thus it is safe to pass such string values to
31397
standard library and system routines. However, because
31398
@command{gawk} allows embedded @sc{NUL} characters in string data,
31399
you should check that @samp{strlen(@var{some_string})} matches
31400
the length for that string passed to the extension before using
31401
it as a regular C string.
31296
31402
@end quotation