A collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. It has been widely used by students, educators, and researchers all over the world as a primary source of machine learning data sets.
Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.
Kaggle
Scikit-learn (sklearn.datasets)
TensorFlow Datasets (tensorflow_datasets)
TorchVision (torchvision.datasets) (For PyTorch)
Base R (datasets package)
caret Package (Classification and Regression Training)
mlbench Package (Machine Learning Benchmarks)
ggplot2 (via ggplot2::mpg)
