~ubuntu-branches/ubuntu/vivid/tesseract/vivid

« back to all changes in this revision

Viewing changes to doc/html/a00657.html

  • Committer: Package Import Robot
  • Author(s): Jeff Breidenbach
  • Date: 2014-02-03 11:10:20 UTC
  • mfrom: (1.3.1) (19.1.1 experimental)
  • Revision ID: package-import@ubuntu.com-20140203111020-igquodd7pjlp3uri
Tags: 3.03.01-1
* New upstream release, includes critical fix to PDF rendering
* Complete leptonlib transition (see bug #735509)
* Promote from experimental to unstable

Show diffs side-by-side

added added

removed removed

Lines of Context:
 
1
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
 
2
<html xmlns="http://www.w3.org/1999/xhtml">
 
3
<head>
 
4
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
 
5
<meta http-equiv="X-UA-Compatible" content="IE=9"/>
 
6
<title>tesseract: tesseract::WordUnigrams Class Reference</title>
 
7
 
 
8
<link href="tabs.css" rel="stylesheet" type="text/css"/>
 
9
<link href="doxygen.css" rel="stylesheet" type="text/css" />
 
10
<link href="navtree.css" rel="stylesheet" type="text/css"/>
 
11
<script type="text/javascript" src="jquery.js"></script>
 
12
<script type="text/javascript" src="resize.js"></script>
 
13
<script type="text/javascript" src="navtree.js"></script>
 
14
<script type="text/javascript">
 
15
  $(document).ready(initResizable);
 
16
</script>
 
17
<link href="search/search.css" rel="stylesheet" type="text/css"/>
 
18
<script type="text/javascript" src="search/search.js"></script>
 
19
<script type="text/javascript">
 
20
  $(document).ready(function() { searchBox.OnSelectItem(0); });
 
21
</script>
 
22
 
 
23
</head>
 
24
<body>
 
25
<div id="top"><!-- do not remove this div! -->
 
26
 
 
27
 
 
28
<div id="titlearea">
 
29
<table cellspacing="0" cellpadding="0">
 
30
 <tbody>
 
31
 <tr style="height: 56px;">
 
32
  
 
33
  
 
34
  <td style="padding-left: 0.5em;">
 
35
   <div id="projectname">tesseract
 
36
   &#160;<span id="projectnumber">3.03</span>
 
37
   </div>
 
38
   
 
39
  </td>
 
40
  
 
41
  
 
42
  
 
43
 </tr>
 
44
 </tbody>
 
45
</table>
 
46
</div>
 
47
 
 
48
<!-- Generated by Doxygen 1.7.6.1 -->
 
49
<script type="text/javascript">
 
50
var searchBox = new SearchBox("searchBox", "search",false,'Search');
 
51
</script>
 
52
  <div id="navrow1" class="tabs">
 
53
    <ul class="tablist">
 
54
      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
 
55
      <li><a href="pages.html"><span>Related&#160;Pages</span></a></li>
 
56
      <li><a href="modules.html"><span>Modules</span></a></li>
 
57
      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
 
58
      <li class="current"><a href="annotated.html"><span>Classes</span></a></li>
 
59
      <li><a href="files.html"><span>Files</span></a></li>
 
60
      <li>
 
61
        <div id="MSearchBox" class="MSearchBoxInactive">
 
62
        <span class="left">
 
63
          <img id="MSearchSelect" src="search/mag_sel.png"
 
64
               onmouseover="return searchBox.OnSearchSelectShow()"
 
65
               onmouseout="return searchBox.OnSearchSelectHide()"
 
66
               alt=""/>
 
67
          <input type="text" id="MSearchField" value="Search" accesskey="S"
 
68
               onfocus="searchBox.OnSearchFieldFocus(true)" 
 
69
               onblur="searchBox.OnSearchFieldFocus(false)" 
 
70
               onkeyup="searchBox.OnSearchFieldChange(event)"/>
 
71
          </span><span class="right">
 
72
            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
 
73
          </span>
 
74
        </div>
 
75
      </li>
 
76
    </ul>
 
77
  </div>
 
78
  <div id="navrow2" class="tabs2">
 
79
    <ul class="tablist">
 
80
      <li><a href="annotated.html"><span>Class&#160;List</span></a></li>
 
81
      <li><a href="hierarchy.html"><span>Class&#160;Hierarchy</span></a></li>
 
82
      <li><a href="functions.html"><span>Class&#160;Members</span></a></li>
 
83
    </ul>
 
84
  </div>
 
85
</div>
 
86
<div id="side-nav" class="ui-resizable side-nav-resizable">
 
87
  <div id="nav-tree">
 
88
    <div id="nav-tree-contents">
 
89
    </div>
 
90
  </div>
 
91
  <div id="splitbar" style="-moz-user-select:none;" 
 
92
       class="ui-resizable-handle">
 
93
  </div>
 
94
</div>
 
95
<script type="text/javascript">
 
96
  initNavTree('a00657.html','');
 
97
</script>
 
98
<div id="doc-content">
 
99
<div class="header">
 
100
  <div class="summary">
 
101
<a href="#pub-methods">Public Member Functions</a> &#124;
 
102
<a href="#pub-static-methods">Static Public Member Functions</a> &#124;
 
103
<a href="#pro-methods">Protected Member Functions</a>  </div>
 
104
  <div class="headertitle">
 
105
<div class="title">tesseract::WordUnigrams Class Reference</div>  </div>
 
106
</div><!--header-->
 
107
<div class="contents">
 
108
<!-- doxytag: class="tesseract::WordUnigrams" -->
 
109
<p><code>#include &lt;<a class="el" href="a01021_source.html">word_unigrams.h</a>&gt;</code></p>
 
110
 
 
111
<p><a href="a01857.html">List of all members.</a></p>
 
112
<table class="memberdecls">
 
113
<tr><td colspan="2"><h2><a name="pub-methods"></a>
 
114
Public Member Functions</h2></td></tr>
 
115
<tr><td class="memItemLeft" align="right" valign="top">&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="a00657.html#afd00490f957dd1842384c372a2949141">WordUnigrams</a> ()</td></tr>
 
116
<tr><td class="memItemLeft" align="right" valign="top">&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="a00657.html#a60f14b6354f60c7411695103bc422601">~WordUnigrams</a> ()</td></tr>
 
117
<tr><td class="memItemLeft" align="right" valign="top">int&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="a00657.html#a298b0661e1327c8dcb27bfdd790d92d3">Cost</a> (const <a class="el" href="a01265.html#aea2c6172b0ca77907e29cd018595b425">char_32</a> *str32, <a class="el" href="a00445.html">LangModel</a> *lang_mod, <a class="el" href="a00309.html">CharSet</a> *char_set) const </td></tr>
 
118
<tr><td colspan="2"><h2><a name="pub-static-methods"></a>
 
119
Static Public Member Functions</h2></td></tr>
 
120
<tr><td class="memItemLeft" align="right" valign="top">static <a class="el" href="a00657.html">WordUnigrams</a> *&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="a00657.html#a7e57ac1c4afd17adaeb8abb06351dc76">Create</a> (const string &amp;data_file_path, const string &amp;lang)</td></tr>
 
121
<tr><td colspan="2"><h2><a name="pro-methods"></a>
 
122
Protected Member Functions</h2></td></tr>
 
123
<tr><td class="memItemLeft" align="right" valign="top">int&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="a00657.html#a14623fdba39beee456c834cdc6a8976f">CostInternal</a> (const char *str) const </td></tr>
 
124
</table>
 
125
<hr/><a name="details" id="details"></a><h2>Detailed Description</h2>
 
126
<div class="textblock">
 
127
<p>Definition at line <a class="el" href="a01021_source.html#l00034">34</a> of file <a class="el" href="a01021_source.html">word_unigrams.h</a>.</p>
 
128
</div><hr/><h2>Constructor &amp; Destructor Documentation</h2>
 
129
<a class="anchor" id="afd00490f957dd1842384c372a2949141"></a><!-- doxytag: member="tesseract::WordUnigrams::WordUnigrams" ref="afd00490f957dd1842384c372a2949141" args="()" -->
 
130
<div class="memitem">
 
131
<div class="memproto">
 
132
      <table class="memname">
 
133
        <tr>
 
134
          <td class="memname"><a class="el" href="a00657.html#afd00490f957dd1842384c372a2949141">tesseract::WordUnigrams::WordUnigrams</a> </td>
 
135
          <td>(</td>
 
136
          <td class="paramname"></td><td>)</td>
 
137
          <td></td>
 
138
        </tr>
 
139
      </table>
 
140
</div>
 
141
<div class="memdoc">
 
142
 
 
143
<p>Definition at line <a class="el" href="a01020_source.html#l00032">32</a> of file <a class="el" href="a01020_source.html">word_unigrams.cpp</a>.</p>
 
144
<div class="fragment"><pre class="fragment">                           {
 
145
  costs_ = NULL;
 
146
  words_ = NULL;
 
147
  word_cnt_ = 0;
 
148
}
 
149
</pre></div>
 
150
</div>
 
151
</div>
 
152
<a class="anchor" id="a60f14b6354f60c7411695103bc422601"></a><!-- doxytag: member="tesseract::WordUnigrams::~WordUnigrams" ref="a60f14b6354f60c7411695103bc422601" args="()" -->
 
153
<div class="memitem">
 
154
<div class="memproto">
 
155
      <table class="memname">
 
156
        <tr>
 
157
          <td class="memname"><a class="el" href="a00657.html#a60f14b6354f60c7411695103bc422601">tesseract::WordUnigrams::~WordUnigrams</a> </td>
 
158
          <td>(</td>
 
159
          <td class="paramname"></td><td>)</td>
 
160
          <td></td>
 
161
        </tr>
 
162
      </table>
 
163
</div>
 
164
<div class="memdoc">
 
165
 
 
166
<p>Definition at line <a class="el" href="a01020_source.html#l00038">38</a> of file <a class="el" href="a01020_source.html">word_unigrams.cpp</a>.</p>
 
167
<div class="fragment"><pre class="fragment">                            {
 
168
  <span class="keywordflow">if</span> (words_ != NULL) {
 
169
    <span class="keywordflow">if</span> (words_[0] != NULL) {
 
170
      <span class="keyword">delete</span> []words_[0];
 
171
    }
 
172
 
 
173
    <span class="keyword">delete</span> []words_;
 
174
    words_ = NULL;
 
175
  }
 
176
 
 
177
  <span class="keywordflow">if</span> (costs_ != NULL) {
 
178
    <span class="keyword">delete</span> []costs_;
 
179
  }
 
180
}
 
181
</pre></div>
 
182
</div>
 
183
</div>
 
184
<hr/><h2>Member Function Documentation</h2>
 
185
<a class="anchor" id="a298b0661e1327c8dcb27bfdd790d92d3"></a><!-- doxytag: member="tesseract::WordUnigrams::Cost" ref="a298b0661e1327c8dcb27bfdd790d92d3" args="(const char_32 *str32, LangModel *lang_mod, CharSet *char_set) const " -->
 
186
<div class="memitem">
 
187
<div class="memproto">
 
188
      <table class="memname">
 
189
        <tr>
 
190
          <td class="memname">int <a class="el" href="a00657.html#a298b0661e1327c8dcb27bfdd790d92d3">tesseract::WordUnigrams::Cost</a> </td>
 
191
          <td>(</td>
 
192
          <td class="paramtype">const <a class="el" href="a01265.html#aea2c6172b0ca77907e29cd018595b425">char_32</a> *&#160;</td>
 
193
          <td class="paramname"><em>str32</em>, </td>
 
194
        </tr>
 
195
        <tr>
 
196
          <td class="paramkey"></td>
 
197
          <td></td>
 
198
          <td class="paramtype"><a class="el" href="a00445.html">LangModel</a> *&#160;</td>
 
199
          <td class="paramname"><em>lang_mod</em>, </td>
 
200
        </tr>
 
201
        <tr>
 
202
          <td class="paramkey"></td>
 
203
          <td></td>
 
204
          <td class="paramtype"><a class="el" href="a00309.html">CharSet</a> *&#160;</td>
 
205
          <td class="paramname"><em>char_set</em>&#160;</td>
 
206
        </tr>
 
207
        <tr>
 
208
          <td></td>
 
209
          <td>)</td>
 
210
          <td></td><td> const</td>
 
211
        </tr>
 
212
      </table>
 
213
</div>
 
214
<div class="memdoc">
 
215
 
 
216
<p>Definition at line <a class="el" href="a01020_source.html#l00150">150</a> of file <a class="el" href="a01020_source.html">word_unigrams.cpp</a>.</p>
 
217
<div class="fragment"><pre class="fragment">                                                {
 
218
  <span class="keywordflow">if</span> (!key_str32)
 
219
    <span class="keywordflow">return</span> 0;
 
220
  <span class="comment">// convert string to UTF8 to split into space-separated words</span>
 
221
  <span class="keywordtype">string</span> key_str;
 
222
  <a class="code" href="a01265.html#a34f64417c417283fbb3cbdd220e77ae2">CubeUtils::UTF32ToUTF8</a>(key_str32, &amp;key_str);
 
223
  vector&lt;string&gt; words;
 
224
  <a class="code" href="a00343.html#af7dea4521db1e7099c93606d0f5bf4a4">CubeUtils::SplitStringUsing</a>(key_str, <span class="stringliteral">&quot; \t&quot;</span>, &amp;words);
 
225
 
 
226
  <span class="comment">// no words =&gt; no cost</span>
 
227
  <span class="keywordflow">if</span> (words.size() &lt;= 0) {
 
228
    <span class="keywordflow">return</span> 0;
 
229
  }
 
230
 
 
231
  <span class="comment">// aggregate the costs of all the words</span>
 
232
  <span class="keywordtype">int</span> cost = 0;
 
233
  <span class="keywordflow">for</span> (<span class="keywordtype">int</span> word_idx = 0; word_idx &lt; words.size(); word_idx++) {
 
234
    <span class="comment">// convert each word back to UTF32 for analyzing case and punctuation</span>
 
235
    <a class="code" href="a01265.html#a5e51e225e3ac13c918ed9f77715fa041">string_32</a> str32;
 
236
    <a class="code" href="a01265.html#a903ffe993b52191b0b18a019dd790276">CubeUtils::UTF8ToUTF32</a>(words[word_idx].c_str(), &amp;str32);
 
237
    <span class="keywordtype">int</span> len = <a class="code" href="a00343.html#a88fe596e3dcadab7909c0bff64f61f59">CubeUtils::StrLen</a>(str32.c_str());
 
238
 
 
239
    <span class="comment">// strip all trailing punctuation</span>
 
240
    <span class="keywordtype">string</span> clean_str;
 
241
    <span class="keywordtype">int</span> clean_len = len;
 
242
    <span class="keywordtype">bool</span> trunc = <span class="keyword">false</span>;
 
243
    <span class="keywordflow">while</span> (clean_len &gt; 0 &amp;&amp;
 
244
           lang_mod-&gt;IsTrailingPunc(str32.c_str()[clean_len - 1])) {
 
245
      --clean_len;
 
246
      trunc = <span class="keyword">true</span>;
 
247
    }
 
248
 
 
249
    <span class="comment">// If either the original string was not truncated (no trailing</span>
 
250
    <span class="comment">// punctuation) or the entire string was removed (all characters</span>
 
251
    <span class="comment">// are trailing punctuation), evaluate original word as is;</span>
 
252
    <span class="comment">// otherwise, copy all but the trailing punctuation characters</span>
 
253
    <a class="code" href="a01265.html#aea2c6172b0ca77907e29cd018595b425">char_32</a> *clean_str32 = NULL;
 
254
    <span class="keywordflow">if</span> (clean_len == 0 || !trunc) {
 
255
      clean_str32 = <a class="code" href="a00343.html#a2cea16ea1fc9c8d1020c159617ed90e3">CubeUtils::StrDup</a>(str32.c_str());
 
256
    } <span class="keywordflow">else</span> {
 
257
      clean_str32 = <span class="keyword">new</span> <a class="code" href="a01265.html#aea2c6172b0ca77907e29cd018595b425">char_32</a>[clean_len + 1];
 
258
      <span class="keywordflow">for</span> (<span class="keywordtype">int</span> i = 0; i &lt; clean_len; ++i) {
 
259
        clean_str32[i] = str32[i];
 
260
      }
 
261
      clean_str32[clean_len] = <span class="charliteral">&#39;\0&#39;</span>;
 
262
    }
 
263
    <a class="code" href="a00823.html#a93a603f4063a6b9403d81caa245a583b">ASSERT_HOST</a>(clean_str32 != NULL);
 
264
 
 
265
    <span class="keywordtype">string</span> str8;
 
266
    <a class="code" href="a01265.html#a34f64417c417283fbb3cbdd220e77ae2">CubeUtils::UTF32ToUTF8</a>(clean_str32, &amp;str8);
 
267
    <span class="keywordtype">int</span> word_cost = <a class="code" href="a00657.html#a14623fdba39beee456c834cdc6a8976f">CostInternal</a>(str8.c_str());
 
268
 
 
269
    <span class="comment">// if case invariant, get costs of all-upper-case and all-lower-case</span>
 
270
    <span class="comment">// versions and return the min cost</span>
 
271
    <span class="keywordflow">if</span> (clean_len &gt;= kMinLengthNumOrCaseInvariant &amp;&amp;
 
272
        <a class="code" href="a00343.html#a0c67516e85144e0d736f30c21208aeda">CubeUtils::IsCaseInvariant</a>(clean_str32, char_set)) {
 
273
      <a class="code" href="a01265.html#aea2c6172b0ca77907e29cd018595b425">char_32</a> *lower_32 = <a class="code" href="a00343.html#ac051dbde8b019f824b1bd8ae8d69d10e">CubeUtils::ToLower</a>(clean_str32, char_set);
 
274
      <span class="keywordflow">if</span> (lower_32) {
 
275
        <span class="keywordtype">string</span> lower_8;
 
276
        <a class="code" href="a01265.html#a34f64417c417283fbb3cbdd220e77ae2">CubeUtils::UTF32ToUTF8</a>(lower_32, &amp;lower_8);
 
277
        word_cost = MIN(word_cost, <a class="code" href="a00657.html#a14623fdba39beee456c834cdc6a8976f">CostInternal</a>(lower_8.c_str()));
 
278
        <span class="keyword">delete</span> [] lower_32;
 
279
      }
 
280
      <a class="code" href="a01265.html#aea2c6172b0ca77907e29cd018595b425">char_32</a> *upper_32 = <a class="code" href="a00343.html#ac5f9453d9b30ec4343940ad428a4638d">CubeUtils::ToUpper</a>(clean_str32, char_set);
 
281
      <span class="keywordflow">if</span> (upper_32) {
 
282
        <span class="keywordtype">string</span> upper_8;
 
283
        <a class="code" href="a01265.html#a34f64417c417283fbb3cbdd220e77ae2">CubeUtils::UTF32ToUTF8</a>(upper_32, &amp;upper_8);
 
284
        word_cost = MIN(word_cost, <a class="code" href="a00657.html#a14623fdba39beee456c834cdc6a8976f">CostInternal</a>(upper_8.c_str()));
 
285
        <span class="keyword">delete</span> [] upper_32;
 
286
      }
 
287
    }
 
288
 
 
289
    <span class="keywordflow">if</span> (clean_len &gt;= kMinLengthNumOrCaseInvariant) {
 
290
      <span class="comment">// if characters are all numeric, incur 0 word cost</span>
 
291
      <span class="keywordtype">bool</span> is_numeric = <span class="keyword">true</span>;
 
292
      <span class="keywordflow">for</span> (<span class="keywordtype">int</span> i = 0; i &lt; clean_len; ++i) {
 
293
        <span class="keywordflow">if</span> (!lang_mod-&gt;IsDigit(clean_str32[i]))
 
294
          is_numeric = <span class="keyword">false</span>;
 
295
      }
 
296
      <span class="keywordflow">if</span> (is_numeric)
 
297
        word_cost = 0;
 
298
    }
 
299
    <span class="keyword">delete</span> [] clean_str32;
 
300
    cost += word_cost;
 
301
  }  <span class="comment">// word_idx</span>
 
302
 
 
303
  <span class="comment">// return the mean cost</span>
 
304
  <span class="keywordflow">return</span> <span class="keyword">static_cast&lt;</span><span class="keywordtype">int</span><span class="keyword">&gt;</span>(cost / <span class="keyword">static_cast&lt;</span><span class="keywordtype">double</span><span class="keyword">&gt;</span>(words.size()));
 
305
}
 
306
</pre></div>
 
307
</div>
 
308
</div>
 
309
<a class="anchor" id="a14623fdba39beee456c834cdc6a8976f"></a><!-- doxytag: member="tesseract::WordUnigrams::CostInternal" ref="a14623fdba39beee456c834cdc6a8976f" args="(const char *str) const " -->
 
310
<div class="memitem">
 
311
<div class="memproto">
 
312
      <table class="memname">
 
313
        <tr>
 
314
          <td class="memname">int <a class="el" href="a00657.html#a14623fdba39beee456c834cdc6a8976f">tesseract::WordUnigrams::CostInternal</a> </td>
 
315
          <td>(</td>
 
316
          <td class="paramtype">const char *&#160;</td>
 
317
          <td class="paramname"><em>str</em></td><td>)</td>
 
318
          <td> const<code> [protected]</code></td>
 
319
        </tr>
 
320
      </table>
 
321
</div>
 
322
<div class="memdoc">
 
323
 
 
324
<p>Definition at line <a class="el" href="a01020_source.html#l00243">243</a> of file <a class="el" href="a01020_source.html">word_unigrams.cpp</a>.</p>
 
325
<div class="fragment"><pre class="fragment">                                                        {
 
326
  <span class="keywordflow">if</span> (strlen(key_str) == 0)
 
327
    <span class="keywordflow">return</span> not_in_list_cost_;
 
328
  <span class="keywordtype">int</span> hi = word_cnt_ - 1;
 
329
  <span class="keywordtype">int</span> lo = 0;
 
330
  <span class="keywordflow">while</span> (lo &lt;= hi) {
 
331
    <span class="keywordtype">int</span> current = (hi + lo) / 2;
 
332
    <span class="keywordtype">int</span> comp = strcmp(key_str, words_[current]);
 
333
    <span class="comment">// a match</span>
 
334
    <span class="keywordflow">if</span> (comp == 0) {
 
335
      <span class="keywordflow">return</span> costs_[current];
 
336
    }
 
337
    <span class="keywordflow">if</span> (comp &lt; 0) {
 
338
      <span class="comment">// go lower</span>
 
339
      hi = current - 1;
 
340
    } <span class="keywordflow">else</span> {
 
341
      <span class="comment">// go higher</span>
 
342
      lo = current + 1;
 
343
    }
 
344
  }
 
345
  <span class="keywordflow">return</span> not_in_list_cost_;
 
346
}
 
347
</pre></div>
 
348
</div>
 
349
</div>
 
350
<a class="anchor" id="a7e57ac1c4afd17adaeb8abb06351dc76"></a><!-- doxytag: member="tesseract::WordUnigrams::Create" ref="a7e57ac1c4afd17adaeb8abb06351dc76" args="(const string &amp;data_file_path, const string &amp;lang)" -->
 
351
<div class="memitem">
 
352
<div class="memproto">
 
353
      <table class="memname">
 
354
        <tr>
 
355
          <td class="memname"><a class="el" href="a00657.html">WordUnigrams</a> * <a class="el" href="a00657.html#a7e57ac1c4afd17adaeb8abb06351dc76">tesseract::WordUnigrams::Create</a> </td>
 
356
          <td>(</td>
 
357
          <td class="paramtype">const string &amp;&#160;</td>
 
358
          <td class="paramname"><em>data_file_path</em>, </td>
 
359
        </tr>
 
360
        <tr>
 
361
          <td class="paramkey"></td>
 
362
          <td></td>
 
363
          <td class="paramtype">const string &amp;&#160;</td>
 
364
          <td class="paramname"><em>lang</em>&#160;</td>
 
365
        </tr>
 
366
        <tr>
 
367
          <td></td>
 
368
          <td>)</td>
 
369
          <td></td><td><code> [static]</code></td>
 
370
        </tr>
 
371
      </table>
 
372
</div>
 
373
<div class="memdoc">
 
374
 
 
375
<p>Definition at line <a class="el" href="a01020_source.html#l00055">55</a> of file <a class="el" href="a01020_source.html">word_unigrams.cpp</a>.</p>
 
376
<div class="fragment"><pre class="fragment">                                                       {
 
377
  <span class="keywordtype">string</span> file_name;
 
378
  <span class="keywordtype">string</span> str;
 
379
 
 
380
  file_name = data_file_path + <a class="code" href="a01266.html#a4d02e13fee24fdebbbe98ccdcb9c9279">lang</a>;
 
381
  file_name += <span class="stringliteral">&quot;.cube.word-freq&quot;</span>;
 
382
 
 
383
  <span class="comment">// load the string into memory</span>
 
384
  <span class="keywordflow">if</span> (<a class="code" href="a00343.html#ac5c5bf284cd96f78f62f19938bec750a">CubeUtils::ReadFileToString</a>(file_name, &amp;str) == <span class="keyword">false</span>) {
 
385
    <span class="keywordflow">return</span> NULL;
 
386
  }
 
387
 
 
388
  <span class="comment">// split into lines</span>
 
389
  vector&lt;string&gt; str_vec;
 
390
  <a class="code" href="a00343.html#af7dea4521db1e7099c93606d0f5bf4a4">CubeUtils::SplitStringUsing</a>(str, <span class="stringliteral">&quot;\r\n \t&quot;</span>, &amp;str_vec);
 
391
  <span class="keywordflow">if</span> (str_vec.size() &lt; 2) {
 
392
    <span class="keywordflow">return</span> NULL;
 
393
  }
 
394
 
 
395
  <span class="comment">// allocate memory</span>
 
396
  <a class="code" href="a00657.html#afd00490f957dd1842384c372a2949141">WordUnigrams</a> *word_unigrams_obj = <span class="keyword">new</span> <a class="code" href="a00657.html#afd00490f957dd1842384c372a2949141">WordUnigrams</a>();
 
397
  <span class="keywordflow">if</span> (word_unigrams_obj == NULL) {
 
398
    fprintf(stderr, <span class="stringliteral">&quot;Cube ERROR (WordUnigrams::Create): could not create &quot;</span>
 
399
            <span class="stringliteral">&quot;word unigrams object.\n&quot;</span>);
 
400
    <span class="keywordflow">return</span> NULL;
 
401
  }
 
402
 
 
403
  <span class="keywordtype">int</span> full_len = str.length();
 
404
  <span class="keywordtype">int</span> word_cnt = str_vec.size() / 2;
 
405
  word_unigrams_obj-&gt;words_ = <span class="keyword">new</span> <span class="keywordtype">char</span>*[word_cnt];
 
406
  word_unigrams_obj-&gt;costs_ = <span class="keyword">new</span> <span class="keywordtype">int</span>[word_cnt];
 
407
 
 
408
  <span class="keywordflow">if</span> (word_unigrams_obj-&gt;words_ == NULL ||
 
409
      word_unigrams_obj-&gt;costs_ == NULL) {
 
410
    fprintf(stderr, <span class="stringliteral">&quot;Cube ERROR (WordUnigrams::Create): error allocating &quot;</span>
 
411
            <span class="stringliteral">&quot;word unigram fields.\n&quot;</span>);
 
412
    <span class="keyword">delete</span> word_unigrams_obj;
 
413
    <span class="keywordflow">return</span> NULL;
 
414
  }
 
415
 
 
416
  word_unigrams_obj-&gt;words_[0] = <span class="keyword">new</span> <span class="keywordtype">char</span>[full_len];
 
417
  <span class="keywordflow">if</span> (word_unigrams_obj-&gt;words_[0] == NULL) {
 
418
    fprintf(stderr, <span class="stringliteral">&quot;Cube ERROR (WordUnigrams::Create): error allocating &quot;</span>
 
419
            <span class="stringliteral">&quot;word unigram fields.\n&quot;</span>);
 
420
    <span class="keyword">delete</span> word_unigrams_obj;
 
421
    <span class="keywordflow">return</span> NULL;
 
422
  }
 
423
 
 
424
  <span class="comment">// construct sorted list of words and costs</span>
 
425
  word_unigrams_obj-&gt;word_cnt_ = 0;
 
426
  <span class="keywordtype">char</span> *char_buff = word_unigrams_obj-&gt;words_[0];
 
427
  word_cnt = 0;
 
428
  <span class="keywordtype">int</span> max_cost = 0;
 
429
 
 
430
  <span class="keywordflow">for</span> (<span class="keywordtype">int</span> wrd = 0; wrd &lt; str_vec.size(); wrd += 2) {
 
431
    word_unigrams_obj-&gt;words_[word_cnt] = char_buff;
 
432
 
 
433
    strcpy(char_buff, str_vec[wrd].c_str());
 
434
    char_buff += (str_vec[wrd].length() + 1);
 
435
 
 
436
    <span class="keywordflow">if</span> (sscanf(str_vec[wrd + 1].c_str(), <span class="stringliteral">&quot;%d&quot;</span>,
 
437
               word_unigrams_obj-&gt;costs_ + word_cnt) != 1) {
 
438
      fprintf(stderr, <span class="stringliteral">&quot;Cube ERROR (WordUnigrams::Create): error reading &quot;</span>
 
439
              <span class="stringliteral">&quot;word unigram data.\n&quot;</span>);
 
440
      <span class="keyword">delete</span> word_unigrams_obj;
 
441
      <span class="keywordflow">return</span> NULL;
 
442
    }
 
443
    <span class="comment">// update max cost</span>
 
444
    max_cost = MAX(max_cost, word_unigrams_obj-&gt;costs_[word_cnt]);
 
445
    word_cnt++;
 
446
  }
 
447
  word_unigrams_obj-&gt;word_cnt_ = word_cnt;
 
448
 
 
449
  <span class="comment">// compute the not-in-list-cost by assuming that a word not in the list</span>
 
450
  <span class="comment">// [ahmadab]: This can be computed as follows:</span>
 
451
  <span class="comment">// - Given that the distribution of words follow Zipf&#39;s law:</span>
 
452
  <span class="comment">//   (F = K / (rank ^ S)), where s is slightly &gt; 1.0</span>
 
453
  <span class="comment">// - Number of words in the list is N</span>
 
454
  <span class="comment">// - The mean frequency of a word that did not appear in the list is the</span>
 
455
  <span class="comment">//   area under the rest of the Zipf&#39;s curve divided by 2 (the mean)</span>
 
456
  <span class="comment">// - The area would be the bound integral from N to infinity =</span>
 
457
  <span class="comment">//   (K * S) / (N ^ (S + 1)) ~= K / (N ^ 2)</span>
 
458
  <span class="comment">// - Given that cost = -LOG(prob), the cost of an unlisted word would be</span>
 
459
  <span class="comment">//   = max_cost + 2*LOG(N)</span>
 
460
  word_unigrams_obj-&gt;not_in_list_cost_ = max_cost +
 
461
      (2 * <a class="code" href="a00343.html#a0983096f5ebcb35879a1e0e6038c32f8">CubeUtils::Prob2Cost</a>(1.0 / word_cnt));
 
462
  <span class="comment">// success</span>
 
463
  <span class="keywordflow">return</span> word_unigrams_obj;
 
464
}
 
465
</pre></div>
 
466
</div>
 
467
</div>
 
468
<hr/>The documentation for this class was generated from the following files:<ul>
 
469
<li>/usr/local/google/home/jbreiden/tesseract-ocr-read-only/cube/<a class="el" href="a01021_source.html">word_unigrams.h</a></li>
 
470
<li>/usr/local/google/home/jbreiden/tesseract-ocr-read-only/cube/<a class="el" href="a01020_source.html">word_unigrams.cpp</a></li>
 
471
</ul>
 
472
</div><!-- contents -->
 
473
</div>
 
474
<!-- window showing the filter options -->
 
475
<div id="MSearchSelectWindow"
 
476
     onmouseover="return searchBox.OnSearchSelectShow()"
 
477
     onmouseout="return searchBox.OnSearchSelectHide()"
 
478
     onkeydown="return searchBox.OnSearchSelectKey(event)">
 
479
<a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(0)"><span class="SelectionMark">&#160;</span>All</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(1)"><span class="SelectionMark">&#160;</span>Classes</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(2)"><span class="SelectionMark">&#160;</span>Namespaces</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(3)"><span class="SelectionMark">&#160;</span>Files</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(4)"><span class="SelectionMark">&#160;</span>Functions</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(5)"><span class="SelectionMark">&#160;</span>Variables</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(6)"><span class="SelectionMark">&#160;</span>Typedefs</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(7)"><span class="SelectionMark">&#160;</span>Enumerations</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(8)"><span class="SelectionMark">&#160;</span>Enumerator</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(9)"><span class="SelectionMark">&#160;</span>Friends</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(10)"><span class="SelectionMark">&#160;</span>Defines</a></div>
 
480
 
 
481
<!-- iframe showing the search results (closed by default) -->
 
482
<div id="MSearchResultsWindow">
 
483
<iframe src="javascript:void(0)" frameborder="0" 
 
484
        name="MSearchResults" id="MSearchResults">
 
485
</iframe>
 
486
</div>
 
487
 
 
488
  <div id="nav-path" class="navpath">
 
489
    <ul>
 
490
      <li class="navelem"><a class="el" href="a01265.html">tesseract</a>      </li>
 
491
      <li class="navelem"><a class="el" href="a00657.html">WordUnigrams</a>      </li>
 
492
 
 
493
    <li class="footer">Generated on Mon Feb 3 2014 10:59:21 for tesseract by
 
494
    <a href="http://www.doxygen.org/index.html">
 
495
    <img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.7.6.1 </li>
 
496
   </ul>
 
497
 </div>
 
498
 
 
499
 
 
500
</body>
 
501
</html>