Automatické rozpoznávání a indexování knižních obsahů

Název dokumentu u vědecko-technických monografií nepřináší informace o všech obsazených tématech. Napr. z názvu knihy „XML technologie: Principy a aplikace v praxi“ nepoznáme, jestli se v ní píše o jazyce XQuery. V techto případech selhávái vecnýpopis bibliografického záznamu, protože katalogizátor často neznádo hloubky popisovanou problematiku, navíc omezený počet a míra jemnosti predmětových hesel jsou pro tyto prípady nedostatečné. Naopak, obsah knihy (TOC, table of content) u vě̌decko-technických dokumentu velmi presne popisuje obsahy jednotlivých kapitol a podkapitol, takže z nich můžeme úspešne dolovat klíčová slova s vysokou relevancí. Taková klíčová slova se pak mohou stát vstupními daty pro indexování ve vyhledávacích nástrojích typu OPAC nebo discovery systém, kde uživatel získá možnost vyhledávat i podle slov a frází vyskytujících se v obsahu díla.

Automatic Recognition and Indexing Books’ Tables of Content
Document titles of technical or scientific books do not express all topics covered inside. E.g. the title of the book “XML Technology: Principlesand Applications in Practice” does not bring any information if the book says something on the XQuery language. In these cases, subject description in bibliographic record also fails, because cataloguer often does not know the topics in the book in depth. Subject headings are limited in numbers and depth of expressions. Conversely, the book’s table of content for scientific and technical documents accurately describes the content of individual chapters and subchapters, so that we can successfully mine the keywords from it with high relevance. Such keywords may then become input data for indexing in the search engines (e.g. OPAC or discovery systems) where users can search by words or phrases found in the content of the work.

Automatické rozpoznávání a indexování knižních obsahů

Zdieľať:

Číslo: 3/2015

Obsah čísla