TY - JOUR
T1 - Development of a gold-standard pashto dataset and a segmentation app
AU - Han, Yan
AU - Rychlik, Marek
N1 - Funding Information:
The authors would like to thank the National Endowment for the Humanities for its grant (PR-263939-19) to our project Development of Image-to-text Conversion for Pashto and Traditional Chinese. The authors would like to thank Riaz Ahmad and Saeeda Naz for providing the NUCES FAST ligature dataset. The authors would also like to thank Atifa Rawan, Sayyed M. Vazirizade, and Sharam Parastesch for their valuable contributions. Ms. Rawan selected the sample Pashto manuscripts and reviewed the lines. Dr. Vazirizade worked on segmentation algorithms and code. Ph.D. student Sharam Parastesch keyed in and verified the dataset.
Funding Information:
Rawan and Han at the University of Arizona Libraries have been collaborating with the Afghanistan Centre at Kabul University (ACKU), the de facto National Library of Afghanistan. The purpose of the 13-year-long collaboration is to preserve and provide open access to Afghanistan’s unique materials from the ACKU’s physical collections. Initially funded by a grant of ? ? ? ?, ? ? ? from the National Endowment for the Humanities (NEH) for the period of 2008 to 2012, the project digitized 200,000 pages of materials from the modern period. The project continues to receive support from the University of Arizona and the ACKU. The ACKU’s permanent collection is the most extensive in the region covering a time of war and social upheaval in the country, with most of the documents in the principal languages of Pashto, Dari (Persian), and English with a variety of formats such as monographs, series, reports, yearbooks, videos, and newspapers. In addition, Rawan and Han also pursued related Afghani scholars’ collections including those of Ludwig W. Adamec and M. Mobin Shorish. A repository (www.afghandata.org) has been openly accessible containing these unique materials dating from the 1950s to the present. The repository has grown from the initial 200,000 pages to 2 million, and is the biggest digital repository in the world covering Afghanistan and its region with more than 200,000 active users viewing 400,000 pages per year.
Publisher Copyright:
© 2021.
PY - 2021/3
Y1 - 2021/3
N2 - The article aims to introduce a gold-standard Pashto dataset and a segmentation app. The Pashto dataset consists of 300 line images and corresponding Pashto text from three selected books. A line image is simply an image consisting of one text line from a scanned page. To our knowledge, this is one of the first open access datasets which directly maps line images to their corresponding text in the Pashto language. We also introduce the development of a segmentation app using textbox expanding algorithms, a different approach to OCR segmentation. The authors discuss the steps to build a Pashto dataset and develop our unique approach to segmentation. The article starts with the nature of the Pashto alphabet and its unique diacritics which require special considerations for segmentation. Needs for datasets and a few available Pashto datasets are reviewed. Criteria of selection of data sources are discussed and three books were selected by our language specialist from the Afghan Digital Repository. The authors review previous segmentation methods and introduce a new approach to segmentation for Pashto content. The segmentation app and results are discussed to show readers how to adjust variables for different books. Our unique segmentation approach uses an expanding textbox method which performs very well given the nature of the Pashto scripts. The app can also be used for Persian and other languages using the Arabic writing system. The dataset can be used for OCR training, OCR testing, and machine learning applications related to content in Pashto.
AB - The article aims to introduce a gold-standard Pashto dataset and a segmentation app. The Pashto dataset consists of 300 line images and corresponding Pashto text from three selected books. A line image is simply an image consisting of one text line from a scanned page. To our knowledge, this is one of the first open access datasets which directly maps line images to their corresponding text in the Pashto language. We also introduce the development of a segmentation app using textbox expanding algorithms, a different approach to OCR segmentation. The authors discuss the steps to build a Pashto dataset and develop our unique approach to segmentation. The article starts with the nature of the Pashto alphabet and its unique diacritics which require special considerations for segmentation. Needs for datasets and a few available Pashto datasets are reviewed. Criteria of selection of data sources are discussed and three books were selected by our language specialist from the Afghan Digital Repository. The authors review previous segmentation methods and introduce a new approach to segmentation for Pashto content. The segmentation app and results are discussed to show readers how to adjust variables for different books. Our unique segmentation approach uses an expanding textbox method which performs very well given the nature of the Pashto scripts. The app can also be used for Persian and other languages using the Arabic writing system. The dataset can be used for OCR training, OCR testing, and machine learning applications related to content in Pashto.
UR - http://www.scopus.com/inward/record.url?scp=85104249844&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85104249844&partnerID=8YFLogxK
U2 - 10.6017/ITAL.V40I1.12553
DO - 10.6017/ITAL.V40I1.12553
M3 - Article
AN - SCOPUS:85104249844
SN - 0730-9295
VL - 40
JO - Information Technology and Libraries
JF - Information Technology and Libraries
IS - 1
ER -