{"id":903,"date":"2020-06-06T18:08:24","date_gmt":"2020-06-06T16:08:24","guid":{"rendered":"https:\/\/dhlunch.ijp.pan.pl\/?page_id=903"},"modified":"2020-06-06T18:08:24","modified_gmt":"2020-06-06T16:08:24","slug":"12-06-2020","status":"publish","type":"page","link":"https:\/\/dhlunch.ijppan.pl\/en\/12-06-2020\/","title":{"rendered":"12.06.2020"},"content":{"rendered":"<p>Albert Le\u015bniak and Ma\u0142gorzata Czachor (IJP PAN)<\/p>\n<p><b>How small a corpus can be? The efficiency of TF-IDF keyword extraction in relation to the corpus size<\/b><span style=\"font-weight: 400\">\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\">A method called term frequency \u2013 inverse document frequency (TF-IDF ) is a widely used algorithm for extracting keywords. The obtained score depends on two quantities: the frequency of the word in a document (term frequency, or TF) and the number of documents that contain this word (inverse document frequency, or IDF). Whereas TF is an intrinsic feature of every single text, IDF is based on the entire corpus (or more precisely, on the fraction of texts containing a given word, in the total number of the texts included in the corpus), therefore the larger the number of documents on which IDF is based, the more reliable the outcome. The aim of the talk is to answer the question what is the minimal corpus size for TF-IDF, or to be more precise, to what extent diminishing the size of the corpus affects the effectiveness of TF-IDF. The study is based on four corpora: Interia.pl (220 000 texts), weekly magazines (220 000 texts), Gutenberg Library (29 750 texts) and short extracts from Gutenberg Library (29 750 texts). The IDF was first computed using all the texts, then, iteratively, on a decreasing number of them, thus in each iteration IDF is based on a different (smaller) number of texts. Since in each iteration IDF is different also the keyness (the TF-IDF score) is different. Still, the results are surprisingly stable. The scores obtained from a very small corpus are not very different from ones based on the entire considered collection of texts.<\/span><\/p>\n<p><span style=\"font-weight: 400\">The results show that small corpora constitute a reliable basis for this algorithm. The conclusion is interesting by itself, but also vital for practical applications. Processing large corpora is still a work- and time-consuming process, therefore, provided it does not cause a dramatic drop of efficiency, training the IDF on a smaller corpus is beneficial. <\/span><\/p>\n<p><iframe title=\"DH Lunch: Jak ma\u0142y mo\u017ce by\u0107 korpus?\" width=\"660\" height=\"371\" src=\"https:\/\/www.youtube.com\/embed\/L07krG11HmQ?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe><\/p>\n<p>Link to the Zoom discussion after the meeting: https:\/\/zoom.us\/j\/92355384866?pwd=ckl2bmRZYWxmVEs3RFVVVDRuNlQ4dz09<\/p>","protected":false},"excerpt":{"rendered":"<p>Albert Le\u015bniak and Ma\u0142gorzata Czachor (IJP PAN) How small a corpus can be? The efficiency of TF-IDF keyword extraction in relation to the corpus size\u00a0 A method called term frequency &hellip; <a href=\"https:\/\/dhlunch.ijppan.pl\/en\/12-06-2020\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">12.06.2020<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-903","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/dhlunch.ijppan.pl\/en\/wp-json\/wp\/v2\/pages\/903","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dhlunch.ijppan.pl\/en\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/dhlunch.ijppan.pl\/en\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/dhlunch.ijppan.pl\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dhlunch.ijppan.pl\/en\/wp-json\/wp\/v2\/comments?post=903"}],"version-history":[{"count":0,"href":"https:\/\/dhlunch.ijppan.pl\/en\/wp-json\/wp\/v2\/pages\/903\/revisions"}],"wp:attachment":[{"href":"https:\/\/dhlunch.ijppan.pl\/en\/wp-json\/wp\/v2\/media?parent=903"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}