Manual¶
Input¶
- djvu file of the newspaper in which user wants to find obituary.
- in.tsv file (train) : in this file are kept – filename of the newspaper and page number with coordinates where obituaries are placed format: newspaper_title.djvu page_number left_upper_x1_coordinate left_upper_y1_coordinate lower_right_x2_coordinate lower_right_y2_coordinate -> where separators are tabs.
- in.tsv file (test) : in this file filename of the newspaper is kept.
Install¶
- make install : install all of the requirements.
- make install-doc : install sphinx.
Train¶
- make train-split : creates .necro files (newspaper_title.necro) which include coordinates of obituary and page which contains that obituary.
- make train : trains vowpal wabbit model (launches : train-unpack, train-generate, train-lm, train-vw).
- make train-unpack : unpacks newspapers from train directory. Extracts metadata (title, type, language etc), page in tiff, xml and text layer of page.
- make train-bpe : trains BPE model based on txt files of newspapers.
- make train-generate : generates rectangles “noticed” on page – potential obituaries.
- make train-classify : tags generated rectangles using information stored in necro file.
- make train-lm : trains language models – char based 3-gram model of necrologies, BPE based 3-gram model of pages with necrologies
- make train-analyze : analyzes newspapers – extracts text and graphic features, counts language model score of page rectangles and language model score of page with current rectangle.
- make train-merge : merges vw files of newspapers into train.in file.
- make train-vw : trains vw model.
- make train-purge : removes necro, vw files and train.* from train directory; corpora, arpa and klm files from LM directory, BPE model from BPE directory.
- make train-clean : removes train.* files from train directory. corpora, arpa, klm files from LM directory, BPE model from BPE directory.
Test¶
- make test : tests trained vw model.
- make test-unpack: unpacks newspapers from test-A directory. Extracts metadata (title, type, language etc), page in tiff, xml and text layer of page.
- make test-generate : generates rectangles of page – potential obituaries.
- make test-analyze : analyzes newspapers – extracts text and graphic features, counts language model score of page rectangles and language model score of page with current rectangle.
- make test-predict : predicts obituaries for all newspapers in test-A directory.
- make test-merge : creates out.tsv file where coordinates of obituaries are kept.
- make test-purge : removes vw, predict, out.tsv and newspaper_title.out.tsv files from test directory.
- make test-clean : removes out.tsv and newspaper_title.out.tsv files from test directory.
Dev¶
- make dev : tests trained vw model.
- make dev-unpack: unpacks newspapers from dev-0 directory. Extracts metadata (title, type, language etc), page in tiff, xml and text layer of page.
- make dev-generate : generates rectangles of page – potential obituaries.
- make dev-analyze : analyzes newspapers – extracts text and graphic features, counts language model score of page rectangles and language model score of page with current rectangle.
- make dev-predict : predicts obituaries for all newspapers in dev-0 directory.
- make dev-merge : creates out.tsv file where coordinates of obituaries are kept.
- make dev-purge : removes vw, predict, out.tsv and newspaper_title.out.tsv files from dev-0 directory.
- make dev-clean : removes out.tsv and newspaper_title.out.tsv files from dev-0 directory.
Clean¶
- make clean-unpack : removes txt files from train directory and files created by unpack command.
- make clean-generate : removes files created by generate command.
- make clean-classify : removes files created by classify command.
- make clean-analyze : removes files created by analyze command.
- make purge : removes vw, predict, out.tsv and newspaper_title.out.tsv files from test directory; necro, vw files and train.* from train directory; corpora, arpa and klm files from LM directory, BPE model from BPE directory.
- make clean : removes out.tsv and newspaper_title.out.tsv files from test directory; train.* files from train directory; corpora, arpa, klm files from LM directory, BPE model from BPE directory.
Doc¶
- make -f doc_maker html : creates automatically generated documentation of python modules.
- make -f doc_maker clean : removes content of build directory.
Cut¶
- make dev-cut : cuts obituary from the newspaper, bases on merged out.tsv and in.tsv files, puts obutuaries into dev-0/obituaries directory.
- make test-cut : cuts obituary from the newspaper, bases on merged out.tsv and in.tsv files, puts obutuaries into test-A/obituaries directory.