xml_extract module

Prints coordinates of paragraphs, words and lines due to given .xml file .xml file needs to be cleaned by xml_cleaner.py Helps in checking what is word and what is the picture fragment.

xml_extract.check_paragraph(para_xml)[source]

Checks if paragraphs contains trash, returns true if not and false if yes

Args:
para_xml (str) : xml of paragraph
xml_extract.create_output()[source]

Prints out the final output

xml_extract.create_words_lines_output(coordinates_words)[source]

Function which helps in making data for lines and words

xml_extract.get_alpha(line)[source]

Returns amount of alphanumeric chars.

Args:
line (str) : string which needs to be checked
xml_extract.get_lines_xml(para_xml)[source]

Get lines from xml

Args:
para_xml (str) : xml of “PARAGRAPH”
xml_extract.get_paragraphs_xml(root)[source]

Get paragraphs from xml

xml_extract.get_words_xml(line_xml)[source]

Get words from xml file

Args:
line_xml (str) : xml of “LINE”