xml_extract module¶
Prints coordinates of paragraphs, words and lines due to given .xml file .xml file needs to be cleaned by xml_cleaner.py Helps in checking what is word and what is the picture fragment.
-
xml_extract.
check_paragraph
(para_xml)[source]¶ Checks if paragraphs contains trash, returns true if not and false if yes
- Args:
- para_xml (str) : xml of paragraph
-
xml_extract.
create_words_lines_output
(coordinates_words)[source]¶ Function which helps in making data for lines and words
-
xml_extract.
get_alpha
(line)[source]¶ Returns amount of alphanumeric chars.
- Args:
- line (str) : string which needs to be checked