Identifying Website Applications Through Zero-Content URL Classification
  • Khan, Muhammad Aleem Siddique
  • Basit, Addul
  • Raza, Ahmad
  • Zeeshan, Muhammad
  • Khan, Muzamil
  • 외 1명
Citations

SCOPUS

0

초록

Approximately 1.13 billion websites span diverse domains like music, movies, health, video games, social media, entertainment, and more, making website classification highly valuable yet challenging. This research explores using Artificial Intelligence techniques to categorize websites into appropriate classes. The DMOZ dataset is one of the most reliable data sources for research focused on URLs, containing about 1.5 million URLs manually classified into different categories, though the distribution of URLs across different categories is highly unbalanced. To overcome this limitation, we collected 500,000 URLs from DomCop's Top 10 Million Websites and classified them using Large Language Models (LLMs), specifically LLaMA 3 3B. Numerous machine learning and deep learning models were trained on both datasets, including logistic regression, support vector machine, decision trees, random forest, CatBoost, XgBoost, LSTM, Bi-LSTM, and FastText. Different data handling and pre-processing techniques were also employed, including chunking for balancing and subdomain extraction to increase URLs per category. We report the most successful pipeline using the FastText model, achieving an average accuracy of 82%.

키워드

application-type identificationcharacter n-gram tokenizationero-content url classificationfasttext AI frameworkLLM dataset labeling
제목
Identifying Website Applications Through Zero-Content URL Classification
저자
Khan, Muhammad Aleem SiddiqueBasit, AddulRaza, AhmadZeeshan, MuhammadKhan, MuzamilAnsar,
DOI
10.1109/ICACS69208.2026.11433158
발행일
2026
유형
Conference Paper
저널명
2026 7th International Conference on Advancements in Computational Sciences, ICACS 2026