~jonas-drange/ubuntu-start-page/1252899-mobile-friendly

<li><a href="#internationalization-and-localization" id="id12" name="id12" class="reference">3 Internationalization and Localization</a><ul class="auto-toc">

<li><a href="#getting-started" id="id13" name="id13" class="reference">3.1 Getting Started</a></li>

<li><a href="#testing-the-application" id="id14" name="id14" class="reference">3.2 Testing the Application</a></li>

<li><a href="#missing-translations" id="id15" name="id15" class="reference">3.3 Missing Translations</a></li>

<li><a href="#translations-within-templates" id="id16" name="id16" class="reference">3.4 Translations Within Templates</a></li>

<li><a href="#producing-a-python-egg" id="id17" name="id17" class="reference">3.5 Producing a Python Egg</a></li>

<li><a href="#plural-forms" id="id18" name="id18" class="reference">3.6 Plural Forms</a></li>

</ul>

</li>

<li><a href="#summary" id="id19" name="id19" class="reference">4 Summary</a></li>

<li><a href="#further-reading" id="id20" name="id20" class="reference">5 Further Reading</a></li>

</ul>

</div>

<p>Internationalization and localization are means of adapting software for

non-native environments, especially for other nations and cultures.</p>

<p>Parts of an application which might need to be localized might include:</p>

<li>Language</li>

<li>Date/time format</li>

<li>Formatting of numbers e.g. decimal points, positioning of separators,

character used as separator</li>

<li>Time zones (UTC in internationalized environments)</li>

<li>Currency</li>

<li>Weights and measures</li>

</ul>

</blockquote>

<p>The distinction between internationalization and localization is subtle but

important. Internationalization is the adaptation of products for potential use

virtually everywhere, while localization is the addition of special features

for use in a specific locale.</p>

<p>For example, in terms of language used in software, internationalization is the

process of marking up all strings that might need to be translated whilst

localization is the process of producing translations for a particular locale.</p>

<p>Pylons provides built-in support to enable you to internationalize language but

leaves you to handle any other aspects of internationalization which might be

appropriate to your application.</p>

<p class="last">Internationalization is often abbreviated as I18N (or i18n or I18n) where the

number 18 refers to the number of letters omitted.

Localization is often abbreviated L10n or l10n in the same manner. These

abbreviations also avoid picking one spelling (internationalisation vs.

internationalization, etc.) over the other.</p>

</div>

<p>In order to represent characters from multiple languages, you will need to use

Unicode so this documentation will start with a description of why Unicode is

useful, its history and how to use Unicode in Python.</p>

<h1><a href="#id1" id="understanding-unicode" name="understanding-unicode" class="toc-backref">1 Understanding Unicode</a></h1>

<p>If you've ever come across text in a foreign language that contains lots of

<tt class="docutils literal"><span class="pre">????</span></tt> characters or have written some Python code and received a message

such as <tt class="docutils literal"><span class="pre">UnicodeDecodeError:</span> <span class="pre">'ascii'</span> <span class="pre">codec</span> <span class="pre">can't</span> <span class="pre">decode</span> <span class="pre">byte</span> <span class="pre">0xff</span> <span class="pre">in</span> <span class="pre">position</span>

<span class="pre">6:</span> <span class="pre">ordinal</span> <span class="pre">not</span> <span class="pre">in</span> <span class="pre">range(128)</span></tt> then you have run into a problem with character

sets, encodings, Unicode and the like.</p>

<p>The truth is that many developers are put off by Unicode because most of the

time it is possible to muddle through rather than take the time to learn the

100

basics. To make the problem worse if you have a system that manages to fudge

101

the issues and just about work and then start trying to do things properly with

102

Unicode it often highlights problems in other parts of your code.</p>

103

<p>The good news is that Python has great Unicode support, so the rest of

104

this article will show you how to correctly use Unicode in Pylons to avoid

105

unwanted <tt class="docutils literal"><span class="pre">?</span></tt> characters and <tt class="docutils literal"><span class="pre">UnicodeDecodeErrors</span></tt>.</p>

106

107

<h2><a href="#id2" id="what-is-unicode" name="what-is-unicode" class="toc-backref">1.1 What is Unicode?</a></h2>

108

<p>When computers were first being used the characters that were most important

109

were unaccented English letters. Each of these letters could be represented by

110

a number between 32 and 127 and thus was born ASCII, a character set where

111

space was 32, the letter "A" was 65 and everything could be stored in 7 bits.</p>

112

<p>Most computers in those days were using 8-bit bytes so people quickly realized

113

that they could use the codes 128-255 for their own purposes. Different people

114

used the codes 128-255 to represent different characters and before long these

115

different sets of characters were also standardized into <em>code pages</em>. This

116

meant that if you needed some non-ASCII characters in a document you could also

117

specify a codepage which would define which extra characters were available.

118

For example Israel DOS used a code page called 862, while Greek users used 737.

119

This just about worked for Western languages provided you didn't want to write

120

an Israeli document with Greek characters but it didn't work at all for Asian

121

languages where there are many more characters than can be represented in 8

122

bits.</p>

123

<p>Unicode is a character set that solves these problems by uniquely defining

124

<em>every</em> character that is used anywhere in the world. Rather than defining a

125

character as a particular combination of bits in the way ASCII does, each

126

character is assigned a <em>code point</em>. For example the word <tt class="docutils literal"><span class="pre">hello</span></tt> is made

127

from code points <tt class="docutils literal"><span class="pre">U+0048</span> <span class="pre">U+0065</span> <span class="pre">U+006C</span> <span class="pre">U+006C</span> <span class="pre">U+006F</span></tt>. The full list of code

128

points can be found at <a href="http://www.unicode.org/charts/" class="reference">http://www.unicode.org/charts/</a>.</p>

129

<p>There are lots of different ways of encoding Unicode code points into bits but

130

the most popular encoding is UTF-8. Using UTF-8, every code point from 0-127 is

131

stored in a single byte. Only code points 128 and above are stored using 2, 3,

132

in fact, up to 6 bytes. This has the useful side effect that English text looks

133

exactly the same in UTF-8 as it did in ASCII, because for every

134

ASCII character with hexadecimal value 0xXY, the corresponding Unicode

135

code point is U+00XY. This backwards compatibility is why if you are developing

136

an application that is only used by English speakers you can often get away

137

without handling characters properly and still expect things to work most of

138

the time. Of course, if you use a different encoding such as UTF-16 this

139

doesn't apply since none of the code points are encoded to 8 bits.</p>

140

<p>The important things to note from the discussion so far are that:</p>

141

<ul>

142

<li><p class="first">Unicode can represent pretty much any character in any writing system in

143

widespread use today</p>

144

</li>

145

<li><p class="first">Unicode uses code points to represent characters and the way these map to bits

146

in memory depends on the encoding</p>

147

</li>

148

149

<dt>The most popular encoding is UTF-8 which has several convenient properties:</dt>

150

151

<li>It can handle any Unicode code point</li>

152

<li>A Unicode string is turned into a string of bytes containing no embedded

153

zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be

154

processed by C functions such as strcpy() and sent through protocols that can't

155

handle zero bytes</li>

156

<li>A string of ASCII text is also valid UTF-8 text</li>

157

<li>UTF-8 is fairly compact; the majority of code points are turned into two

158

bytes, and values less than 128 occupy only a single byte.</li>

159

<li>If bytes are corrupted or lost, it's possible to determine the start of

160

the next UTF-8-encoded code point and resynchronize.</li>

161

</ol>

162

</dd>

163

</dl>

164

</li>

165

</ul>

166

167

168

<p class="last">Since Unicode 3.1, some extensions have even been defined so that the

169

defined range is now U+000000 to U+10FFFF (21 bits), and formally, the

170

character set is defined as 31-bits to allow for future expansion. It is a myth

171

that there are 65,536 Unicode code points and that every Unicode letter can

172

really be squeezed into two bytes. It is also incorrect to think that UTF-8 can

173

represent less characters than UTF-16. UTF-8 simply uses a variable number of

174

bytes for a character, sometimes just one byte (8 bits).</p>

175

</div>

176

</div>

177

178

<h2><a href="#id3" id="unicode-in-python" name="unicode-in-python" class="toc-backref">1.2 Unicode in Python</a></h2>

179

<p>In Python Unicode strings are expressed as instances of the built-in

180

<tt class="docutils literal"><span class="pre">unicode</span></tt> type. Under the hood, Python represents Unicode strings as either

181

16 or 32 bit integers, depending on how the Python interpreter was compiled.</p>

182

<p>The <tt class="docutils literal"><span class="pre">unicode()</span></tt> constructor has the signature <tt class="docutils literal"><span class="pre">unicode(string[,</span> <span class="pre">encoding,</span>

183

<span class="pre">errors])</span></tt>. All of its arguments should be 8-bit strings. The first argument is

184

converted to Unicode using the specified encoding; if you leave off the

185

encoding argument, the ASCII encoding is used for the conversion, so characters

186

greater than 127 will be treated as errors:</p>

187

188

>>> unicode('hello')

189

u'hello'

190

>>> s = unicode('hello')

191

>>> type(s)

192

193

>>> unicode('hello' + chr(255))

194

Traceback (most recent call last):

195

File "<stdin>", line 1, in ?

196

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6:

197

ordinal not in range(128)

198

</pre>

199

<p>The <tt class="docutils literal"><span class="pre">errors</span></tt> argument specifies what to do if the string can't be decoded to

200

ascii. Legal values for this argument are <tt class="docutils literal"><span class="pre">'strict'</span></tt> (raise a

201

<tt class="docutils literal"><span class="pre">UnicodeDecodeError</span></tt> exception), <tt class="docutils literal"><span class="pre">'replace'</span></tt> (replace the character that

202

can't be decoded with another one), or <tt class="docutils literal"><span class="pre">'ignore'</span></tt> (just leave the character

203

out of the Unicode result).</p>

204

205

206

>>> unicode('\x80abc', errors='strict')

207

Traceback (most recent call last):

208

File "<stdin>", line 1, in ?

209

UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:

210

ordinal not in range(128)

211

>>> unicode('\x80abc', errors='replace')

212

u'\ufffdabc'

213

>>> unicode('\x80abc', errors='ignore')

214

u'abc'

215

</pre>

216

</blockquote>

217

<p>It is important to understand the difference between <em>encoding</em> and <em>decoding</em>.

218

Unicode strings are considered to be the Unicode code points but any

219

representation of the Unicode string has to be encoded to something else, for

220

example UTF-8 or ASCII. So when you are converting an ASCII or UTF-8 string to

221

Unicode you are <em>decoding</em> it and when you are converting from Unicode to UTF-8

222

or ASCII you are <em>encoding</em> it. This is why the error in the example above says

223

that the ASCII codec cannot decode the byte <tt class="docutils literal"><span class="pre">0x80</span></tt> from ASCII to Unicode

224

because it is not in the range(128) or 0-127. In fact <tt class="docutils literal"><span class="pre">0x80</span></tt> is hex for 128

225

which the first number outside the ASCII range. However if we tell Python that

226

the character <tt class="docutils literal"><span class="pre">0x80</span></tt> is encoded with the <tt class="docutils literal"><span class="pre">'latin-1'</span></tt>, <tt class="docutils literal"><span class="pre">'iso_8859_1'</span></tt> or

227

<tt class="docutils literal"><span class="pre">'8859'</span></tt> character sets (which incidentally are different names for the same

228

thing) we get the result we expected:</p>

229

230

>>> unicode('\x80', encoding='latin-1')

231

u'\x80'

232

</textarea><div class="note">

233

234

<p class="last">The character encodings Python supports are listed at

235

<a href="http://docs.python.org/lib/standard-encodings.html" class="reference">http://docs.python.org/lib/standard-encodings.html</a></p>

236

</div>

237

<p>Unicode objects in Python have most of the same methods that normal Python

238

strings provide. Python will try to use the <tt class="docutils literal"><span class="pre">'ascii'</span></tt> codec to convert

239

strings to Unicode if you do an operation on both types:</p>

240

241

>>> a = 'hello'

242

>>> b = unicode(' world!')

243

>>> print a + b

244

u'hello world!'

245

</textarea><p>You can encode a Unicode string using a particular encoding like this:</p>

246

247

>>> u'Hello World!'.encode('UTF-8')

248

'Hello World!'

249

</textarea></div>

250

251

<h2><a href="#id4" id="unicode-literals-in-python-source-code" name="unicode-literals-in-python-source-code" class="toc-backref">1.3 Unicode Literals in Python Source Code</a></h2>

252

<p>In Python source code, Unicode literals are written as strings prefixed with

253

the 'u' or 'U' character:</p>

254

255

>>> u'abcdefghijk'

256

>>> U'lmnopqrstuv'

257

</textarea><p>You can also use <tt class="docutils literal"><span class="pre">"</span></tt>, <tt class="docutils literal"><span class="pre">"""`</span></tt> or <tt class="docutils literal"><span class="pre">'''</span></tt> versions too. For example:</p>

258

259

>>> u"""This

260

... is a really long

261

... Unicode string"""

262

</textarea><p>Specific code points can be written using the <tt class="docutils literal"><span class="pre">\u</span></tt> escape sequence, which is

263

followed by four hex digits giving the code point. If you use <tt class="docutils literal"><span class="pre">\U</span></tt> instead

264

you specify 8 hex digits instead of 4. Unicode literals can also use the same

265

escape sequences as 8-bit strings, including <tt class="docutils literal"><span class="pre">\x</span></tt>, but <tt class="docutils literal"><span class="pre">\x</span></tt> only takes two

266

hex digits so it can't express all the available code points. You can add

267

characters to Unicode strings using the <tt class="docutils literal"><span class="pre">unichr()</span></tt> built-in function and find

268

out what the ordinal is with <tt class="docutils literal"><span class="pre">ord()</span></tt>.</p>

269

<p>Here is an example demonstrating the different alternatives:</p>

270

271

>>> s = u"\x66\u0072\u0061\U0000006e" + unichr(231) + u"ais"

272

>>> # ^^^^ two-digit hex escape

273

>>> # ^^^^^^ four-digit Unicode escape

274

>>> # ^^^^^^^^^^ eight-digit Unicode escape

275

>>> for c in s: print ord(c),

276

...

277

97 102 114 97 110 231 97 105 115

278

>>> print s

279

franÁais

280

</textarea><p>Using escape sequences for code points greater than 127 is fine in small doses

281

but Python 2.4 and above support writing Unicode literals in any encoding as

282

long as you declare the encoding being used by including a special comment as

283

either the first or second line of the source file:</p>

284

285

#!/usr/bin/env python

286

# -*- coding: latin-1 -*-

287

288

u = u'abcdÈ'

289

print ord(u[-1])

290

</textarea><p>If you don't include such a comment, the default encoding used will be ASCII.

291

Versions of Python before 2.4 were Euro-centric and assumed Latin-1 as a

292

default encoding for string literals; in Python 2.4, characters greater than

293

127 still work but result in a warning. For example, the following program has

294

no encoding declaration:</p>

295

296

#!/usr/bin/env python

297

u = u'abcdÈ'

298

print ord(u[-1])

299

</textarea><p>When you run it with Python 2.4, it will output the following warning:</p>

300

301

sys:1: DeprecationWarning: Non-ASCII character '\xe9' in file testas.py on line

302

2, but no encoding declared; see http://www.python.org/peps/pep-0263.html for de

303

tails

304

</pre>

305

<p>and then the following output:</p>

306

307

233

308

</pre>

309

<p>For real world use it is recommended that you use the UTF-8 encoding for your

310

file but you must be sure that your text editor actually saves the file as

311

UTF-8 otherwise the Python interpreter will try to parse UTF-8 characters but

312

they will actually be stored as something else.</p>

313

314

315

<p class="last">Windows users who use the <a href="http://www.scintilla.org/SciTE.html" class="reference">SciTE</a>

316

editor can specify the encoding of their file from the menu using the

317

<tt class="docutils literal"><span class="pre">File->Encoding</span></tt>.</p>

318

</div>

319

320

321

<p class="last">If you are working with Unicode in detail you might also be interested in

322

the <tt class="docutils literal"><span class="pre">unicodedata</span></tt> module which can be used to find out Unicode properties

323

such as a character's name, category, numeric value and the like.</p>

324

</div>

325

</div>

326

327

<h2><a href="#id5" id="input-and-output" name="input-and-output" class="toc-backref">1.4 Input and Output</a></h2>

328

<p>We now know how to use Unicode in Python source code but input and output can

329

also be different using Unicode. Of course, some libraries natively support

330

Unicode and if these libraries return Unicode objects you will not have to do

331

anything special to support them. XML parsers and SQL databases frequently

332

support Unicode for example.</p>

333

<p>If you remember from the discussion earlier, Unicode data consists of code

334

points. In order to send Unicode data via a socket or write it to a file you

335

usually need to encode it to a series of bytes and then decode the data back to

336

Unicode when reading it. You can of course perform the encoding manually

337

reading a byte at the time but since encodings such as UTF-8 can have variable

338

numbers of bytes per character it is usually much easier to use Python's

339

built-in support in the form of the <tt class="docutils literal"><span class="pre">codecs</span></tt> module.</p>

340

<p>The codecs module includes a version of the <tt class="docutils literal"><span class="pre">open()</span></tt> function that

341

returns a file-like object that assumes the file's contents are in a specified

342

encoding and accepts Unicode parameters for methods such as <tt class="docutils literal"><span class="pre">.read()</span></tt> and

343

<tt class="docutils literal"><span class="pre">.write()</span></tt>.</p>

344

<p>The function's parameters are open(filename, mode='rb', encoding=None,

345

errors='strict', buffering=1). <tt class="docutils literal"><span class="pre">mode</span></tt> can be 'r', 'w', or 'a', just like the

346

corresponding parameter to the regular built-in <tt class="docutils literal"><span class="pre">open()</span></tt> function. You can

347

add a <tt class="docutils literal"><span class="pre">+</span></tt> character to update the file. <tt class="docutils literal"><span class="pre">buffering</span></tt> is similar to the

348

standard function's parameter. <tt class="docutils literal"><span class="pre">encoding</span></tt> is a string giving the encoding to

349

use, if not specified or specified as <tt class="docutils literal"><span class="pre">None</span></tt>, a regular Python file object

350

that accepts 8-bit strings is returned. Otherwise, a wrapper object is

351

returned, and data written to or read from the wrapper object will be converted

352

as needed. <tt class="docutils literal"><span class="pre">errors</span></tt> specifies the action for encoding errors and can be one

353

of the usual values of <tt class="docutils literal"><span class="pre">'strict'</span></tt>, <tt class="docutils literal"><span class="pre">'ignore'</span></tt>, or <tt class="docutils literal"><span class="pre">'replace'</span></tt> which we

354

saw right at the begining of this document when we were encoding strings in

355

Python source files.</p>

356

<p>Here is an example of how to read Unicode from a UTF-8 encoded file:</p>

357

358

import codecs

359

f = codecs.open('unicode.txt', encoding='utf-8')

360

for line in f:

361

print repr(line)

362

</textarea><p>It's also possible to open files in update mode, allowing both reading and writing:</p>

363

364

f = codecs.open('unicode.txt', encoding='utf-8', mode='w+')

365

f.write(u"\x66\u0072\u0061\U0000006e" + unichr(231) + u"ais")

366

f.seek(0)

367

print repr(f.readline()[:1])

368

f.close()

369

</textarea><p>Notice that we used the <tt class="docutils literal"><span class="pre">repr()</span></tt> function to display the Unicode data. This

370

is very useful because if you tried to print the Unicode data directly, Python

371

would need to encode it before it could be sent the console and depending on

372

which characters were present and the character set used by the console, an

373

error might be raised. This is avoided if you use <tt class="docutils literal"><span class="pre">repr()</span></tt>.</p>

374

<p>The Unicode character <tt class="docutils literal"><span class="pre">U+FEFF</span></tt> is used as a byte-order mark or BOM, and is often

375

written as the first character of a file in order to assist with auto-detection

376

of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be

377

present at the start of a file, but with others such as UTF-8 it isn't necessary.</p>

378

<p>When such an encoding is used, the BOM will be automatically written as the

379

first character and will be silently dropped when the file is read. There are

380

variants of these encodings, such as 'utf-16-le' and 'utf-16-be' for

381

little-endian and big-endian encodings, that specify one particular byte

382

ordering and don't skip the BOM.</p>

383

384

385

<p class="last">Some editors including SciTE will put a byte order mark (BOM) in the text

386

file when saved as UTF-8, which is strange because UTF-8 doesn't need BOMs.</p>

387

</div>

388

</div>

389

390

<h2><a href="#id6" id="unicode-filenames" name="unicode-filenames" class="toc-backref">1.5 Unicode Filenames</a></h2>

391

<p>Most modern operating systems support the use of Unicode filenames. The

392

filenames are transparently converted to the underlying filesystem encoding.

393

The type of encoding depends on the operating system.</p>

394

<p>On Windows 9x, the encoding is <tt class="docutils literal"><span class="pre">mbcs</span></tt>.</p>

395

<p>On Mac OS X, the encoding is <tt class="docutils literal"><span class="pre">utf-8</span></tt>.</p>

396

<p>On Unix, the encoding is the user's preference according to the

397

result of nl_langinfo(CODESET), or None if the nl_langinfo(CODESET) failed.</p>

398

<p>On Windows NT+, file names are Unicode natively, so no conversion is performed.

399

getfilesystemencoding still returns <tt class="docutils literal"><span class="pre">mbcs</span></tt>, as this is the encoding that

400

applications should use when they explicitly want to convert Unicode strings to

401

byte strings that are equivalent when used as file names.</p>

402

<p><tt class="docutils literal"><span class="pre">mbcs</span></tt> is a special encoding for Windows that effectively means "use

403

whichever encoding is appropriate". In Python 2.3 and above you can find out

404

the system encoding with <tt class="docutils literal"><span class="pre">sys.getfilesystemencoding()</span></tt>.</p>

405

<p>Most file and directory functions and methods support Unicode. For example:</p>

406

407

filename = u"\x66\u0072\u0061\U0000006e" + unichr(231) + u"ais"

408

f = open(filename, 'w')

409

f.write('Some data\n')

410

f.close()

411

</textarea><p>Other functions such as <tt class="docutils literal"><span class="pre">os.listdir()</span></tt> will return Unicode if you pass a

412

Unicode argument and will try to return strings if you pass an ordinary 8 bit

413

string. For example running this example as <tt class="docutils literal"><span class="pre">test.py</span></tt>:</p>

414

415

filename = u"Sample " + unichar(5000)

416

f = open(filename, 'w')

417

f.close()

418

419

import os

420

print os.listdir('.')

421

print os.listdir(u'.')

422

</textarea><p>will produce the following output:</p>

423

424

['Sample?', 'test.py']

425

[u'Sampleu1388', u'test.py']</blockquote>

426

</div>

427

</div>

428

429

<h1><a href="#id7" id="applying-this-to-web-programming" name="applying-this-to-web-programming" class="toc-backref">2 Applying this to Web Programming</a></h1>

430

<p>So far we've seen how to use encoding in source files and seen how to decode

431

text to Unicode and encode it back to text. We've also seen that Unicode

432

objects can be manipulated in similar ways to strings and we've seen how to

433

perform input and output operations on files. Next we are going to look at how

434

best to use Unicode in a web app.</p>

435

436

437

Your application should use Unicode for all strings internally, decoding

438

any input to Unicode as soon as it enters the application and encoding the

439

Unicode to UTF-8 or another encoding only on output.

440

</pre>

441

<p>If you fail to do this you will find that <tt class="docutils literal"><span class="pre">UnicodeDecodeError</span></tt> s will start

442

popping up in unexpected places when Unicode strings are used with normal 8-bit

443

strings because Python's default encoding is ASCII and it will try to decode

444

the text to ASCII and fail. It is always better to do any encoding or decoding

445

at the edges of your application otherwise you will end up patching lots of

446

different parts of your application unnecessarily as and when errors pop up.</p>

447

<p>Unless you have a very good reason not to it is wise to use UTF-8 as the

448

default encoding since it is so widely supported.</p>

449

<p>The second rule is:</p>

450

451

Always test your application with characters above 127 and above 255

452

wherever possible.

453

</pre>

454

<p>If you fail to do this you might think your application is working fine, but as

455

soon as your users do put in non-ASCII characters you will have problems.

456

Using arabic is always a good test and www.google.ae is a good source of sample

457

text.</p>

458

<p>The third rule is:</p>

459

460

Always do any checking of a string for illegal characters once it's in the

461

form that will be used or stored, otherwise the illegal characters might be

462

disguised.

463

</pre>

464

<p>For example, let's say you have a content management system that takes a

465

Unicode filename, and you want to disallow paths with a '/' character. You

466

might write this code:</p>

467

468

def read_file(filename, encoding):

469

if '/' in filename:

470

raise ValueError("'/' not allowed in filenames")

471

unicode_name = filename.decode(encoding)

472

f = open(unicode_name, 'r')

473

# ... return contents of file ...

474

</textarea><p>This is INCORRECT. If an attacker could specify the 'base64' encoding, they

475

could pass <tt class="docutils literal"><span class="pre">L2V0Yy9wYXNzd2Q=</span></tt> which is the base-64 encoded form of the string

476

<tt class="docutils literal"><span class="pre">'/etc/passwd'</span></tt> which is a file you clearly don't want an attacker to get

477

hold of. The above code looks for <tt class="docutils literal"><span class="pre">/</span></tt> characters in the encoded form and

478

misses the dangerous character in the resulting decoded form.</p>

479

<p>Those are the three basic rules so now we will look at some of the places you

480

might want to perform Unicode decoding in a Pylons application.</p>

481

482

<h2><a href="#id8" id="request-parameters" name="request-parameters" class="toc-backref">2.1 Request Parameters</a></h2>

483

<p>Currently the Pylons input values come from <tt class="docutils literal"><span class="pre">request.params</span></tt> but these are

484

not decoded to Unicode by default because not all input should be assumed to be

485

Unicode data.</p>

486

<p>If you would like However you can use the two functions below:</p>

487

488

def decode_multi_dict(md, encoding="UTF-8", errors="strict"):

489

"""Given a MultiDict, decode all its parts from the given encoding.

490

491

This modifies the MultiDict in place.

492

493

encoding, strict

494

These are passed to the decode function.

495

496

"""

497

items = md.items()

498

md.clear()

499

for (k, v) in items:

500

md.add(k.decode(encoding, errors),

501

v.decode(encoding, errors))

502

503

504

def decode_request(request, encoding="UTF-8", errors="strict"):

505

"""Given a request object, decode GET and POST in place.

506

507

This implicitly takes care of params as well.

508

509

"""

510

decode_multi_dict(request.GET, encoding, errors)

511

decode_multi_dict(request.POST, encoding, errors)

512

</textarea><p>These can then be used as follows:</p>

513

514

unicode_params = decode_request(request.params)

515

</textarea><p>This code is discussed in <a href="http://pylonshq.com/project/pylonshq/ticket/135" class="reference">ticket 135</a> but shouldn't be used with

516

file uploads since these shouldn't ordinarily be decoded to Unicode.</p>

517

</div>

518

519

<h2><a href="#id9" id="templating" name="templating" class="toc-backref">2.2 Templating</a></h2>

520

<p>Pylons uses Myghty as its default templating language and Myghty 1.1 and above

521

fully support Unicode. The Myghty documentation explains how to use Unicode and

522

you at <a href="http://www.myghty.org/docs/unicode.myt" class="reference">http://www.myghty.org/docs/unicode.myt</a> but the important idea is that

523

you can Unicode literals pretty much anywhere you can use normal 8-bit strings

524

including in <tt class="docutils literal"><span class="pre">m.write()</span></tt> and <tt class="docutils literal"><span class="pre">m.comp()</span></tt>. You can also pass Unicode data to

525

Pylons' <tt class="docutils literal"><span class="pre">render_response()</span></tt> and <tt class="docutils literal"><span class="pre">Response()</span></tt> callables.</p>

526

<p>Any Unicode data output by Myghty is automatically decoded to whichever

527

encoding you have chosen. The default is UTF-8 but you can choose which

528

encoding to use by editing your project's <tt class="docutils literal"><span class="pre">config/environment.py</span></tt> file and

529

adding an option like this:</p>

530

531

# Add your own Myghty config options here, note that all config options will override

532

# any Pylons config options

533

534

myghty['output_encoding'] = 'UTF-8'

535

</textarea><p>replacing <tt class="docutils literal"><span class="pre">UTF-8</span></tt> with the encoding you wish to use.</p>

536

<p>If you need to disable Unicode support altogether you can set this:</p>

537

538

myghty['disable_unicode'] = True

539

</textarea><p>but again, you would have to have a good reason to want to do this.</p>

540

</div>

541

542

<h2><a href="#id10" id="output-encoding" name="output-encoding" class="toc-backref">2.3 Output Encoding</a></h2>

543

<p>Web pages should be generated with a specific encoding, most likely UTF-8. At

544

the very least, that means you should specify the following in the <tt class="docutils literal"><span class="pre"><head></span></tt>

545

section:</p>

546

547

548

</pre>

549

<p>You should also set the charset in the <tt class="docutils literal"><span class="pre">Content-Type</span></tt> header:</p>

550

551

respones = Response(...)

552

response.headers['Content-type'] = 'text/html; charset=utf-8'

553

</textarea><p>If you specify that your output is UTF-8, generally the web browser will

554

give you UTF-8. If you want the browser to submit data using a different

555

character set, you can set the encoding by adding the <tt class="docutils literal"><span class="pre">accept-encoding</span></tt>

556

tag to your form. Here is an example:</p>

557

558

559

</pre>

560

<p>However, be forewarned that if the user tries to give you non-ASCII

561

text, then:</p>

562

563

564

<li>Firefox will translate the non-ASCII text into HTML entities.</li>

565

<li>IE will ignore your suggested encoding and give you UTF-8 anyway.</li>

566

</ul>

567

</blockquote>

568

<p>The lesson to be learned is that if you output UTF-8, you had better be

569

prepared to accept UTF-8 by decoding the data in <tt class="docutils literal"><span class="pre">request.params</span></tt> as

570

described in the section above entitled "Request Parameters".</p>

571

<p>Another technique which is sometimes used to determine the character set is to

572

use an algorithm to analyse the input and guess the encoding based on

573

probabilities.</p>

574

<p>For instance, if you get a file, and you don't know what encoding it is encoded

575

in, you can often rename the file with a .txt extension and then try to open it

576

in Firefox. Then you can use the "View->Character Encoding" menu to try to

577

auto-detect the encoding.</p>

578

</div>

579

580

<h2><a href="#id11" id="databases" name="databases" class="toc-backref">2.4 Databases</a></h2>

581

<p>Your database driver should automatically convert from Unicode objects to a

582

particular charset when writing and back again when reading. Again it is normal

583

to use UTF-8 which is well supported.</p>

584

<p>You should check your database's documentation for information on how it handles

585

Unicode.</p>

586

<p>For example MySQL's Unicode documentation is here

587

<a href="http://dev.mysql.com/doc/refman/5.0/en/charset-unicode.html" class="reference">http://dev.mysql.com/doc/refman/5.0/en/charset-unicode.html</a></p>

588

<p>Also note that you need to consider both the encoding of the database

589

and the encoding used by the database driver.</p>

590

<p>If you're using MySQL together with SQLAlchemy, see the following, as

591

there are some bugs in MySQLdb that you'll need to work around:</p>

592

<p><a href="http://www.mail-archive.com/sqlalchemy@googlegroups.com/msg00366.html" class="reference">http://www.mail-archive.com/sqlalchemy@googlegroups.com/msg00366.html</a></p>

593

</div>

594

</div>

595

596

<h1><a href="#id12" id="internationalization-and-localization" name="internationalization-and-localization" class="toc-backref">3 Internationalization and Localization</a></h1>

597

<p>By now you should have a good idea of what Unicode is, how to use it in Python

598

and which areas of you application need to pay specific attention to decoding and

599

encoding Unicode data.</p>

600

<p>This final section will look at the issue of making your application work with

601

multiple languages.</p>

602

603

<h2><a href="#id13" id="getting-started" name="getting-started" class="toc-backref">3.1 Getting Started</a></h2>

604

<p>Everywhere in your code where you want strings to be available in different

605

languages you wrap them in the <tt class="docutils literal"><span class="pre">_()</span></tt> function. There

606

are also a number of other translation functions which are documented in the API reference at

607

<a href="http://pylonshq.com/docs/module-pylons.i18n.translation.html" class="reference">http://pylonshq.com/docs/module-pylons.i18n.translation.html</a></p>

608

609

610

<p class="last">The <tt class="docutils literal"><span class="pre">_()</span></tt> function is a reference to the <tt class="docutils literal"><span class="pre">ugettext()</span></tt> function.

611

<tt class="docutils literal"><span class="pre">_()</span></tt> is a convention for marking text to be translated and saves on keystrokes.

612

<tt class="docutils literal"><span class="pre">ugettext()</span></tt> is the Unicode version of <tt class="docutils literal"><span class="pre">gettext()</span></tt>.</p>

613

</div>

614

<p>In our example we want the string <tt class="docutils literal"><span class="pre">'Hello'</span></tt> to appear in three different

615

languages: English, French and Spanish. We also want to display the word

616

<tt class="docutils literal"><span class="pre">'Hello'</span></tt> in the default language. We'll then go on to use some pural words

617

too.</p>

618

<p>Lets call our project <tt class="docutils literal"><span class="pre">translate_demo</span></tt>:</p>

619

620

paster create --template=pylons translate_demo

621

</pre>

622

<p>Now lets add a friendly controller that says hello:</p>

623

624

cd translate_demo

625

paster controller hello

626

</pre>

627

<p>Edit <tt class="docutils literal"><span class="pre">controllers/hello.py</span></tt> controller to look like this making use of the

628

<tt class="docutils literal"><span class="pre">_()</span></tt> function everywhere where the string <tt class="docutils literal"><span class="pre">Hello</span></tt> appears:</p>

629

630

from translate_demo.lib.base import *

631

632

class HelloController(BaseController):

633

634

def index(self):

635

resp = Response()

636

resp.write('Default: %s<br />' % _('Hello'))

637

for lang in ['fr','en','es']:

638

h.set_lang(lang)

639

resp.write("%s: %s<br />" % (h.get_lang(), _('Hello')))

640

return resp

641

</textarea><p>When writing your controllers it is important not to piece sentences together manually because

642

certain languages might need to invert the grammars. As an example this is bad:</p>

643

644

# BAD!

645

msg = _("He told her ")

646

msg += _("not to go outside.")

647

</textarea><p>but this is perfectly acceptable:</p>

648

649

# GOOD

650

msg = _("He told her not to go outside")

651

</textarea><p>The controller has now been internationalized but it will raise a <tt class="docutils literal"><span class="pre">LanguageError</span></tt>

652

until we have specified the alternative languages.</p>

653

<p>Pylons uses <a href="http://www.gnu.org/software/gettext/" class="reference">GNU gettext</a> to handle

654

internationalization. GNU gettext use three types of files in the

655

translation framework.</p>

656

<p>POT (Portable Object Template) files</p>

657

658

The first step in the localization process. A program is used to search through

659

your project's source code and pick out every string passed to one of the

660

translation functions, such as <tt class="docutils literal"><span class="pre">_()</span></tt>. This list is put together in a

661

specially-formatted template file that will form the basis of all

662

translations. This is the <tt class="docutils literal"><span class="pre">.pot</span></tt> file.</blockquote>

663

<p>PO (Portable Object) files</p>

664

665

The second step in the localization process. Using the POT file as a template,

666

the list of messages are translated and saved as a <tt class="docutils literal"><span class="pre">.po</span></tt> file.</blockquote>

667

<p>MO (Machine Object) files</p>

668

669

The final step in the localization process. The PO file is run through a

670

program that turns it into an optimized machine-readable binary file, which is

671

the <tt class="docutils literal"><span class="pre">.mo</span></tt> file. Compiling the translations to machine code makes the

672

localized program much faster in retrieving the translations while it is

673

running.</blockquote>

674

<p>Versions of Pylons prior to 0.9.4 came with a setuptools extension to help with

675

the extraction of strings and production of a <tt class="docutils literal"><span class="pre">.mo</span></tt> file. The implementation

676

did not support Unicode nor the ungettext function and was therfore dropped in

677

Python 0.9.4.</p>

678

<p>You will therefore need to use an external program to perform these tasks. You

679

may use whichever you prefer but <tt class="docutils literal"><span class="pre">xgettext</span></tt> is highly recommended. Python's

680

gettext utility has some bugs, especially regarding plurals.</p>

681

<p>Here are some compatible tools and projects:</p>

682

<p>The Rosetta Project (<a href="https://launchpad.ubuntu.com/rosetta/" class="reference">https://launchpad.ubuntu.com/rosetta/</a>)</p>

683

684

The Ubuntu Linux project has a web site that allows you to translate

685

messages without even looking at a PO or POT file, and export directly to a MO.</blockquote>

686

<p>poEdit (<a href="http://www.poedit.org/" class="reference">http://www.poedit.org/</a>)</p>

687

688

An open source program for Windows and UNIX/Linux which provides an easy-to-use

689

GUI for editing PO files and generating MO files.</blockquote>

690

<p>KBabel (<a href="http://i18n.kde.org/tools/kbabel/" class="reference">http://i18n.kde.org/tools/kbabel/</a>)</p>

691

692

Another open source PO editing program for KDE.</blockquote>

693

<p>GNU Gettext (<a href="http://www.gnu.org/software/gettext/" class="reference">http://www.gnu.org/software/gettext/</a>)</p>

694

695

The official Gettext tools package contains command-line tools for creating

696

POTs, manipulating POs, and generating MOs. For those comfortable with a

697

command shell.</blockquote>

698

<p>As an example we will quickly discuss the use of poEdit which is cross platform

699

and has a GUI which makes it easier to get started with.</p>

700

<p>To use poEdit with the <tt class="docutils literal"><span class="pre">translate_demo</span></tt> you would do the following:</p>

701

702

<li>Download and install poEdit.</li>

703

<li>A dialog pops up. Fill in <em>all</em> the fields you can on the <tt class="docutils literal"><span class="pre">Project</span> <span class="pre">Info</span></tt> tab, enter the path to your project on the <tt class="docutils literal"><span class="pre">Paths</span></tt> tab (ie <tt class="docutils literal"><span class="pre">/path/to/translate_demo</span></tt>) and enter the following keywords on separate lines on the <tt class="docutils literal"><span class="pre">keywords</span></tt> tab: <tt class="docutils literal"><span class="pre">_</span></tt>, <tt class="docutils literal"><span class="pre">N_</span></tt>, <tt class="docutils literal"><span class="pre">ugettext</span></tt>, <tt class="docutils literal"><span class="pre">gettext</span></tt>, <tt class="docutils literal"><span class="pre">ngettext</span></tt>, <tt class="docutils literal"><span class="pre">ungettext</span></tt>.</li>

704

<li>Click OK</li>

705

</ol>

706

<p>poEdit will search your source tree and find all the strings you have marked

707

up. You can then enter your translations in whatever charset you chose in

708

the project info tab. UTF-8 is a good choice.</p>

709

<p>Finally, after entering your translations you then save the catalog and rename

710

the <tt class="docutils literal"><span class="pre">.mo</span></tt> file produced to <tt class="docutils literal"><span class="pre">translate_demo.mo</span></tt> and put it in the

711

<tt class="docutils literal"><span class="pre">translate_demo/i18n/es/LC_MESSAGES</span></tt> directory or whatever is appropriate for

712

your translation.</p>

713

<p>You will need to repeat the process of creating a <tt class="docutils literal"><span class="pre">.mo</span></tt> file for the <tt class="docutils literal"><span class="pre">fr</span></tt>,

714

<tt class="docutils literal"><span class="pre">es</span></tt> and <tt class="docutils literal"><span class="pre">en</span></tt> translations.</p>

715

<p>The relevant lines from <tt class="docutils literal"><span class="pre">i18n/en/LC_MESSAGES/translate_demo.po</span></tt> look like this:</p>

716

717

#: translate_demo\controllers\hello.py:6 translate_demo\controllers\hello.py:9

718

msgid "Hello"

719

msgstr "Hello"

720

</pre>

721

<p>The relevant lines from <tt class="docutils literal"><span class="pre">i18n/es/LC_MESSAGES/translate_demo.po</span></tt> look like this:</p>

722

723

#: translate_demo\controllers\hello.py:6 translate_demo\controllers\hello.py:9

724

msgid "Hello"

725

msgstr "°Hola!"

726

</pre>

727

<p>The relevant lines from <tt class="docutils literal"><span class="pre">i18n/fr/LC_MESSAGES/translate_demo.po</span></tt> look like this:</p>

728

729

#: translate_demo\controllers\hello.py:6 translate_demo\controllers\hello.py:9

730

msgid "Hello"

731

msgstr "Bonjour"

732

</pre>

733

<p>Whichever tools you use you should end up with an <tt class="docutils literal"><span class="pre">i18n</span></tt> directory that looks

734

like this when you have finished:</p>

735

736

i18n/en/LC_MESSAGES/translate_demo.po

737

i18n/en/LC_MESSAGES/translate_demo.mo

738

i18n/es/LC_MESSAGES/translate_demo.po

739

i18n/es/LC_MESSAGES/translate_demo.mo

740

i18n/fr/LC_MESSAGES/translate_demo.po

741

i18n/fr/LC_MESSAGES/translate_demo.mo

742

</pre>

743

</div>

744

745

<h2><a href="#id14" id="testing-the-application" name="testing-the-application" class="toc-backref">3.2 Testing the Application</a></h2>

746

<p>Start the server with the following command:</p>

747

748

paster serve --reload development.ini

749

</pre>

750

<p>Test your controller by visiting <a href="http://localhost:5000/hello" class="reference">http://localhost:5000/hello</a>. You should see

751

the following output:</p>

752

753

Default: Hello

754

fr: Bonjour

755

en: Hello

756

es: °Hola!

757

</pre>

758

<p>You can now set the language used in a controller on the fly.</p>

759

<p>For example this could be used to allow a user to set which language they

760

wanted your application to work in. You could save the value to the session

761

object:</p>

762

763

session['lang'] = 'en'

764

</textarea><p>then on each controller call the language to be used could be read from the

765

session and set in your controller's <tt class="docutils literal"><span class="pre">__before__()</span></tt> method so that the pages

766

remained in the same language that was previously set:</p>

767

768

def __before__(self, action):

769

if session.has_key('lang'):

770

h.set_lang(session['lang'])

771

</textarea><p>One more useful thing to be able to do is to set the default language to be

772

used in the configuration file. Just add a <tt class="docutils literal"><span class="pre">lang</span></tt> variable together with the

773

code of the language you wanted to use in your <tt class="docutils literal"><span class="pre">development.ini</span></tt> file. For

774

example to set the default language to Spanish you would add <tt class="docutils literal"><span class="pre">lang</span> <span class="pre">=</span> <span class="pre">es</span></tt> to

775

your <tt class="docutils literal"><span class="pre">development.ini</span></tt>. The relevant part from the file might look something

776

like this:</p>

777

778

[app:main]

779

use = egg:translate_demo

780

lang = es

781

</textarea><p>If you are running the server with the <tt class="docutils literal"><span class="pre">--reload</span></tt> option the server will

782

automatically restart if you change the <tt class="docutils literal"><span class="pre">development.ini</span></tt> file. Otherwise

783

restart the server manually and the output would this time be as follows:</p>

784

785

Default: °Hola!

786

fr: Bonjour

787

en: Hello

788

es: °Hola!

789

</pre>

790

</div>

791

792

<h2><a href="#id15" id="missing-translations" name="missing-translations" class="toc-backref">3.3 Missing Translations</a></h2>

793

<p>If your code calls <tt class="docutils literal"><span class="pre">_()</span></tt> with a string that doesn't exist in your language

794

catalogue, the string passed to <tt class="docutils literal"><span class="pre">_()</span></tt> is returned instead.</p>

795

<p>Modify the last line of the hello controller to look like this:</p>

796

797

resp.write("%s: %s %s<br />" % (h.get_lang(), _('Hello'), _('World!')))

798

</textarea><div class="warning">

799

<p class="first admonition-title">Warning</p>

800

<p class="last">Of course, in real life breaking up sentences in this way is very dangerous because some

801

grammars might require the order of the words to be different.</p>

802

</div>

803

<p>If you run the example again the output will be:</p>

804

805

Default: °Hola!

806

fr: Bonjour World!

807

en: Hello World!

808

es: °Hola! World!

809

</pre>

810

<p>This is because we never provided a translation for the string <tt class="docutils literal"><span class="pre">'World!'</span></tt> so

811

the string itself is used.</p>

812

</div>

813

814

<h2><a href="#id16" id="translations-within-templates" name="translations-within-templates" class="toc-backref">3.4 Translations Within Templates</a></h2>

815

<p>You can also use the <tt class="docutils literal"><span class="pre">_()</span></tt> function within templates in exactly the same way

816

you do in code. For example:</p>

817

818

<% _('Hello') %>

819

</textarea><p>would produce the string <tt class="docutils literal"><span class="pre">'Hello'</span></tt> in the language you had set.</p>

820

<p>There is one complication though. gettext's <tt class="docutils literal"><span class="pre">xgettext</span></tt> command can only extract

821

strings that need translating from Python code in <tt class="docutils literal"><span class="pre">.py</span></tt> files. This means

822

that if you write <tt class="docutils literal"><span class="pre">_('Hello')</span></tt> in a template such as a Myghty template,

823

<tt class="docutils literal"><span class="pre">xgettext</span></tt> will not find the string <tt class="docutils literal"><span class="pre">'Hello'</span></tt> as one which needs

824

translating.</p>

825

<p>As long as <tt class="docutils literal"><span class="pre">xgettext</span></tt> can find a string marked for translation with one

826

of the translation functions and defined in Python code in your project

827

filesystem it will manage the translation when the same string is defined in a

828

Myghty template and marked for translation.</p>

829

<p>One solution to ensure all strings are picked up for translation is to create a

830

file in <tt class="docutils literal"><span class="pre">lib</span></tt> with an appropriate filename, <tt class="docutils literal"><span class="pre">i18n.py</span></tt> for example, and then

831

add a list of all the strings which appear in your templates so that your

832

translation tool can then extract the strings in <tt class="docutils literal"><span class="pre">lib/i18n.py</span></tt> for

833

translation and use the translated versions in your templates as well.</p>

834

<p>For example if you wanted to ensure the translated string <tt class="docutils literal"><span class="pre">'Good</span> <span class="pre">Morning'</span></tt>

835

was available in all templates you could create a <tt class="docutils literal"><span class="pre">lib/i18n.py</span></tt> file that

836

looked something like this:</p>

837

838

from base import _

839

_('Good Morning')

840

</textarea><p>This approach requires quite a lot of work and is rather fragile. The best

841

solution if you are using a templating system such as Myghty or Cheetah which

842

uses compiled Python files is to use a Makefile to ensure that every template

843

is compiled to Python before running the extraction tool to make sure that

844

every template is scanned.</p>

845

<p>Of course, if your cache directory is in the default location or elsewhere

846

within your project's filesystem, you will probably find that all templates

847

have been compiled as Python files during the course of the development process.

848

This means that your tool's extraction command will successfully pick up

849

strings to translate from the cached files anyway.</p>

850

<p>You may also find that your extraction tool is capable of extracting the

851

strings correctly from the template anyway, particularly if the templating

852

langauge is quite similar to Python. It is best not to rely on this though.</p>

853

</div>

854

855

<h2><a href="#id17" id="producing-a-python-egg" name="producing-a-python-egg" class="toc-backref">3.5 Producing a Python Egg</a></h2>

856

<p>Finally you can produce an egg of your project which includes the translation

857

files like this:</p>

858

859

python setup.py bdist_egg

860

</pre>

861

<p>The <tt class="docutils literal"><span class="pre">setup.py</span></tt> automatically includes the <tt class="docutils literal"><span class="pre">.mo</span></tt> language catalogs your

862

application needs so that your application can be distributed as an egg. This

863

is done with the following line in your <tt class="docutils literal"><span class="pre">setup.py</span></tt> file:</p>

864

865

package_data={'translate_demo': ['i18n/*/LC_MESSAGES/*.mo']},

866

</pre>

867

<p>Internationalization support is zip safe so your application can be run

868

directly from the egg without the need for <tt class="docutils literal"><span class="pre">easy_install</span></tt> to extract it.</p>

869

</div>

870

871

<h2><a href="#id18" id="plural-forms" name="plural-forms" class="toc-backref">3.6 Plural Forms</a></h2>

872

<p>Pylons also defines <tt class="docutils literal"><span class="pre">ungettext()</span></tt> and <tt class="docutils literal"><span class="pre">ngettext()</span></tt> functions which can be imported

873

from <tt class="docutils literal"><span class="pre">pylons.i18n</span></tt>. They are designed for internationalizing plural words and can be

874

used as follows:</p>

875

876

from pylons.i18n import ungettext

877

878

ungettext(

879

'There is %(num)d file here',

880

'There are %(num)d files here',

881

882

) % {'num': n}

883

</textarea><p>If you wish to use plural forms in your application you need to add the appropriate

884

headers to the <tt class="docutils literal"><span class="pre">.po</span></tt> files for the language you are using. You can read more about

885

this at <a href="http://www.gnu.org/software/gettext/manual/html_chapter/gettext_10.html#SEC150" class="reference">http://www.gnu.org/software/gettext/manual/html_chapter/gettext_10.html#SEC150</a></p>

886

<p>One thing to keep in mind is that other languages don't have the same

887

plural forms as English. While English only has 2 pulral forms, singular and

888

plural, Slovenian has 4! That means that you must use gettext's

889

support for pluralization if you hope to get pluralization right.

890

Specifically, the following will not work:</p>

891

892

# BAD!

893

if n == 1:

894

msg = _("There was no dog.")

895

else:

896

msg = _("There were no dogs.")

897

</textarea></div>

898

</div>

899

900

<h1><a href="#id19" id="summary" name="summary" class="toc-backref">4 Summary</a></h1>

901

<p>Hopefully you now understand the history of Unicode, how to use it in Python

902

and where to apply Unicode encoding and decoding in a Pylons application. You

903

should also be able to use Unicode in your web app remembering the basic rule to

904

use UTF-8 to talk to the world, do the encode and decode at the edge of your

905

application.</p>

906

<p>You should also be able to internationalize and then localize your application

907

using Pylons' support for GNU gettext.</p>

908

</div>

909

910

<h1><a href="#id20" id="further-reading" name="further-reading" class="toc-backref">5 Further Reading</a></h1>

911

<p>This information is based partly on the following articles which can be

912

consulted for further information.:</p>

913

<p><a href="http://www.joelonsoftware.com/articles/Unicode.html" class="reference">http://www.joelonsoftware.com/articles/Unicode.html</a></p>

914

<p><a href="http://www.amk.ca/python/howto/unicode" class="reference">http://www.amk.ca/python/howto/unicode</a></p>

915

<p><a href="http://en.wikipedia.org/wiki/Internationalization" class="reference">http://en.wikipedia.org/wiki/Internationalization</a></p>

916

<p>Please feel free to report any mistakes to the Pylons mailing list or to the

917

author. Any corrections or clarifications would be gratefully received.</p>

918

</div>

919

920

</div>

b'\\ No newline at end of file'

Older »