Returns a result character for building the string.
- Signature:
- return_char(unicode_str)
with
- unicode_str:
- This expeted string has to be in unicode format, e.g. latin.small.letter.a will return the character a The returned character should be used for creating the result string.
Splits the amount of grouped characters which has been detected in a textline to single words.
- Signature:
- chars_make_words(lines_glyphs,threshold=None)
with
- lines_glyphs:
- lines_glyphs has to be the list of connected-components which represents the amount of characters for a textline.
- threshold:
- For splitting the amount of characters to single words is a threshold needed. If two characters have more or equal empty space between each other there is a white space detected. The default value for this parameter is None which will make the function calculate a threshold value automatic
The Textline Objects keep a word list as an attribute which expects a list like this functions return value.
Gives a full text string as result.
- Signature:
- textline_to_string(line, heuristic_rules="roman")
with
- line:
- The Textline Object which keeps the characters.
- heuristic_rules:
- Some classified characters need some further heuristic classification rules. Take the apostroph as an example which might get classified as a comma but is placed at the top of a textline. Therefore the apostroph can be classified "manual" as a comma. On default this function includes some rules for often noticed classification errors for _roman_ alphabet.
The result of this function can be used as a final result as this is the full text string.
Check two glyphs for beeing grouped to one single character. This function is for unit connected-components like quotation marks.
- Signature:
- check_upper_neighbors(item,glyph,line)
with
- item:
- Some connected-component.
- glyph:
- Some connected-component.
- line:
- The Textline Object which includes item and glyph
There is returned an array with two elements. The first element keeps a list of characters (images that has been united to a single image) and the second image is a list of characters which has to be removed as these have been united to a single character.
Check two glyphs for beeing grouped to one single character. This function is for unit connected-components like i, j or colon.
- Signature:
- check_glyph_accent(item,glyph)
with
- item:
- Some connected-component.
- glyph:
- Some connected-component.
There is returned an array with two elements. The first element keeps a list of characters (images that has been united to a single image) and the second image is a list of characters which has to be removed as these have been united to a single character.
Segmentates the glyphs which are included in every single textline with simple rules.
- Signature:
- get_line_glyphs(image,textlines)
with
- image:
- The textdocument image which is beeing segmentated.
- textlines:
- A list of connected-components which keeps every textline of the document.
A Page Object has a list named textlines which should include Textline Objects. This list is filled within this function as it is called in a Page method.
Draws hollow rects in an copy of image based on the rects in the glyphs list. If the save parameter is set to 1 a file with the name of filename will be created.
- Signature:
- show_bboxes(image,glyphs,filename=\"segmenated_glyphs.PNG\",save=1)
with:
- image:
- An image of the textdokument which has to be segmentated.
- glyphs:
- A list of rects which will be drawn on image as hallow rects.
- filename:
- A filename for the image file that might be created.
- save:
- On default save is set to 1 which will cause the function to create a new image file named filename. If save is set to 0 the function will try to display the image on-the-fly in a box.
This function is usefull for debugging or for getting information about the segmentation process. The Page object useses this function in three methods for displaying the segmentation of textlines, all single characters and all detected words.