Practice of Robots Exclusion Protocol in Bhutan

Kezang Dema, Thinley Jamtsho

Abstract


Most of the search engines rely on the web robots to collect information from the web. The web is open access and unregulated which makes it easier for the robots to crawl and index all the contents of websites so easily. But not all wish to get their websites and web pages indexed by web crawlers. The diverse crawling activities can be regulated and managed by deploying the Robots Exclusion Protocol (REP) in a file called robots.txt in the server. The method used is a de-facto standard and most of the ethical robots will follow the rules specified in the file. In Bhutan, there are many websites and in order to regulate those bots, the usage of the robots.txt file in the websites are not known since no study has been carried out till date. The main aim of the paper is to investigate the use of robots.txt files in various organizations’ websites in Bhutan. And further, to analyze its content present in the file if it exist. A total of 50 websites from various sectors like colleges, government ministries, autonomous agencies, corporations and newspaper agencies were selected for the investigation to check the usage of the file. Moreover, the files were further studied and analyzed for its file size, types of robots specified, and correct use of the file. The result showed that that almost 70% of the websites investigated are using the default robots.txt file generally created by the Joomla and Word press Content Management Systems (CMS) which ultimately specifies that there is a usage of the file. But on the other hand, the file is not really taken into seriously and almost 70% of it lacks major and best protocols defined in it that will help define the access and denial to various resources to various types of robots available on the web. Approximately 30% of the URLs adopted for the study show that the REP file is not added in their web server, thus providing unregulated access of resources to all types of web robots.

Keywords: Crawler, robots.txt, search engines, robots exclusion protocol, indexing

DOI: 10.7176/JEP/11-35-01

Publication date: December 31st 2020


Full Text: PDF
Download the IISTE publication guideline!

To list your conference here. Please contact the administrator of this platform.

Paper submission email: JEP@iiste.org

ISSN (Paper)2222-1735 ISSN (Online)2222-288X

Please add our address "contact@iiste.org" into your email contact list.

This journal follows ISO 9001 management standard and licensed under a Creative Commons Attribution 3.0 License.

Copyright © www.iiste.org