상세 보기
초록
Approximately 1.13 billion websites span diverse domains like music, movies, health, video games, social media, entertainment, and more, making website classification highly valuable yet challenging. This research explores using Artificial Intelligence techniques to categorize websites into appropriate classes. The DMOZ dataset is one of the most reliable data sources for research focused on URLs, containing about 1.5 million URLs manually classified into different categories, though the distribution of URLs across different categories is highly unbalanced. To overcome this limitation, we collected 500,000 URLs from DomCop's Top 10 Million Websites and classified them using Large Language Models (LLMs), specifically LLaMA 3 3B. Numerous machine learning and deep learning models were trained on both datasets, including logistic regression, support vector machine, decision trees, random forest, CatBoost, XgBoost, LSTM, Bi-LSTM, and FastText. Different data handling and pre-processing techniques were also employed, including chunking for balancing and subdomain extraction to increase URLs per category. We report the most successful pipeline using the FastText model, achieving an average accuracy of 82%.
키워드
- 제목
- Identifying Website Applications Through Zero-Content URL Classification
- 저자
- Khan, Muhammad Aleem Siddique; Basit, Addul; Raza, Ahmad; Zeeshan, Muhammad; Khan, Muzamil; Ansar,
- 발행일
- 2026
- 유형
- Conference Paper
- 저널명
- 2026 7th International Conference on Advancements in Computational Sciences, ICACS 2026