Image by Author
You’ve read on these pages (and I’m guilty of writing some of those articles) that data science projects are crucial for developing the whole package of technical data science skills. That’s true, they are. But what’s also vital is having high-quality datasets for your data science projects. Collecting quality data is just one of the stages of a data science project, but the one that can make or break it.
The question is, where to find this frigging data? Fortunately, numerous websites are offering a wealth of data for various purposes.
Image by Author
You heard about Kaggle, probably the most well-known platform in the data science community. It hosts a vast array of datasets in various formats (CSV, JSON, SQLite, BigQuery) and from multiple industries and topics, such as health, automotive, arts & entertainment, biology, social science, investing, social networks, sports, and so on. You can also search for datasets depending on their technical focus, e.g., computer science, classification, computer vision, NLP, or data visualization.
Currently, there are 274,855 datasets available, so you won’t be lacking data.
Kaggle’s user-friendly interface and active community forums make it an excellent resource for both beginners and professionals.
If you’re a machine learning enthusiast, the UCI Machine Learning Repository should be your go-to site . As the name says, this repository is created by the University of California, Irvine (UCI). They collected an extensive collection of datasets tailored for machine learning. As the datasets cover various topics, they are especially useful These datasets cover a wide range of topics and are particularly useful for those wanting to practice and improve their machine-learning skills.
There are currently 653 datasets; you can browse them by data type, subject area, task, number of features & instances, and feature type.
StrataScratch provides 49 datasets and projects sourced from actual companies. This is particularly beneficial for those preparing for data science interviews, as it helps users develop their technical skills and ability to derive business insights from data. This allows for a practical and industry-relevant approach to data science projects.
The projects cover various topics, such as data exploration, data engineering, business analysis, regression, classification, NLP, and clustering.
Google Dataset Search is a tool whose purpose is to find datasets across the web. You already know how to use it, even if you never heard about it until now. Why? Well, it looks and works like a regular Google search, only it’s focused exclusively on finding datasets. It’s extremely useful if you’re looking for data from various sources, academic papers, and government databases.
Amazon’s AWS Public Datasets program is another site where you can find a lot of open data. With 494 datasets currently available, it’s a precious resource for data scientists. The datasets you find there can be integrated with AWS cloud services. This might be helpful if your projects require more computing resources.
The range of data available includes genomics, meteorology, and astronomy, among others.
Data.gov is a data repository sponsored by the US government and contains data from various US organizations. It includes 283,935 datasets from 132 US organizations. There’s a wide array of data, such as agriculture, public health, finance, education, demographics, economics, and environmental data.
The datasets come in almost 50 different formats, with the most popular including HTML, XML, ZIP, CSV, PDF, ArcGIS GeoServices REST API, KML, GeoJSON, JSON, and TEXT.
FiveThirtyEight by ABC News is their articles’ and graphics’ data and code repository. It’s a perfect resource for data journalists and anyone interested in statistical storytelling. If you’re interested in doing projects that involve current events, politics, sports, and more, this is your source.
It offers more than 160 datasets from 2014 until today.
The World Bank Open Data offers extensive datasets revolving around global development data. This data includes indicators on the economy, environment, and social issues from countries around the world. If you’re interested in global development and socio-economic topics, you might find a lot of interesting data here.
GitHub isn’t only a platform for sharing code. It can also be used for finding datasets for data projects. Lots of organizations and individual users host their datasets on GitHub repositories. This data covers a wide range of topics, often supported by extensive documentation and code for analysis.
OpenML is an online platform for machine learning. This also means giving you access to a lot of data. More specifically, almost 5,400 datasets. It’s designed for sharing, organizing, and discussing data and results of machine learning experiments. OpenML can be integrated with popular machine learning environments, which is a bonus for your data science learning.
The Datasets subreddit is a community-driven source of data. People share everything on reddit. Well, they also share and request datasets for data projects. Sometimes it’s difficult to find data there. But not because of the lack of data. On the contrary! The place brims with data, which can make the search for data quite chaotic sometimes. The data ranges from highly specific and unusual to more traditional datasets. As this is basically a forum, you can also participate in discussions and ask for assistance with datasets.
The statistical office of the European Union is called Eurostat, and it’s a comprehensive source of data. If you’re interested in high-quality statistical data about EU member countries, this should be your main data source. Data on EU countries includes topics such as economy, population, health, and trade.
HDX is an open platform where you can find humanitarian data. It is managed by the United Nations Office for the Coordination of Humanitarian Affairs. This platform provides data revolving around humanitarian crises and emergencies in every country in the world. You could find this useful if you’re into projects focusing on global issues, disaster response, and human welfare.
There are 20,344 active and 2,570 archived datasets with various features and formats.
On the CDC, you can find health-related data. The datasets are focused on various health conditions, risk factors, and public health. So, if these are the topics you’re interested in, you’ll find a lot of useful data here.
The BLS site has lots of data on the US economic conditions, labor market, price changes, quality of life, etc. You’ll find lots of quality datasets if you’re into those topics.
The last source of data I’ll mention is NASA. There’s lots of data on aerospace, applied science, apps, Earth science, management/operations, raw data, software, and space science.
It has more than 10,000 datasets, so don’t get lost in its universe of data!
These 16 websites will, I’m sure, give you enough data to work with until the end of time, which was precisely my goal! However, the amount of data is not everything.
I’ve chosen these sites as they will provide you with a very diverse range of datasets suitable for a variety of data science projects. The dataset specifics differ from industry to industry. So, working with various datasets also allows you to gain domain knowledge.
Whether you’re delving into machine learning, data analysis, data journalism, statistical analysis, or data visualization, you can always count on these resources.
Now, you can do your own data science project! If you need more ideas, here are some data science projects you can do as a beginner.
Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Connect with him on Twitter: StrataScratch or LinkedIn.