NAUTILUS – Pomáhame sprístupniť informácie už raz nájdené / NAUTILUS

Introducingcultural heritagesemantic web

Nemusíte byť študentom, aby ste potrebovali nájsť nejaké informácie. My v škole pracujeme prevažne s digitálnymi knižnicami. S radosťou sme preto prijali správu o tom, že vďaka digitalizácii máme dnes len v Univerzitnej knižnici k dispozícii viac ako milión strán z periodík. Lenže o to väčšie bolo naše sklamanie, keď sme zistili, že vyhľadávanie v plných textoch je síce fajn, ale tie desiatky strán, ktoré získame, nie sú žiadnou výhrou. Chcelo by to analytický rozpis článkov. Lenže je to vôbec možné? Veď na jednej takej strane môže byť hneď niekoľko článkov. Tieto a podobné otázky sme si na začiatku kládli aj my. Rozhodli sme sa však využiť naše znalosti a skúsenosti s technológiami a výsledkom je ….


Semantic search focuses on collecting information about words and by that it is able to detect meaning of these words, lines or even block of text they are part of and to recognize concept matching, relations and synonyms. However, for semantic search it is necessary to have digitized sources, while it requires machine processing. Despite the fact that most of sources are in digitized form, it is not necessarily the only field for applying semantic search. One of the fields where such sources are very common is press archiving, which is important, because old press issues contain useful and important information, which cannot be found anywhere else. It is common practice to archive press sources as one unit per issue. For the correct application of semantic search it is important to parse these issues into individual articles, while each article is a separate unit, whose meaning and content can differ from the others. Whereas the structure of these resources is often inconsistent, especially when the resources are from different country or time period, it is necessary to design a tool for parsing these resources between the descriptive parts and the reworking of periodicals to correct bibliographic records for individual articles. The challenge of our work is future application of semantic search on our dataset, which consists of old digitized press issues for archiving purposes, meaning they come in not consistent state appropriate for information extraction right away. Data in old press can be very important and can contain a lot of interesting information which may be used by digital humanities, scholars, etc. However, it is very difficult to effectively obtain and process this data apart from reading it by eyes. The data created by optical character scanner were irregularly assigned to blocks and paragraphs, not always corresponding to the image. Our goal is to propose a preprocessing method to extract articles from digitized data, defining elements of articles, such as key words, to be used in semantic search and create an environment for people to help automated process to be more precise in defining entities from
raw digitized data.