WeSearch

Exploratory Data Analysis: How to Read a Dataset

·8 min read · 0 reactions · 0 comments · 1 view
Exploratory Data Analysis: How to Read a Dataset

Loading data is not the start of understanding it. You have done the loading. You have done the...

Original article
DEV Community
Read full at DEV Community →
Full article excerpt tap to expand

try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 1358056) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Akhilesh Posted on Apr 28 Exploratory Data Analysis: How to Read a Dataset #programming #ai #productivity #python Loading data is not the start of understanding it. You have done the loading. You have done the cleaning. You have merged tables and filtered rows and built charts one at a time. But those were isolated skills. EDA is what happens when you use all of them together with a specific purpose. You are not just making charts. You are interrogating a dataset. Asking questions. Finding answers. Forming new questions from those answers. Real data science looks like this: you load a dataset, you do not know what is in it, and forty minutes later you know its shape, its problems, its patterns, and you have three specific hypotheses to test with a model. This post walks through that entire process on one real dataset, start to finish. The Dataset We will use the Housing dataset, available on Kaggle (search "House Prices Advanced Regression Kaggle"). It has 79 features and 1460 rows describing residential homes in Ames, Iowa. The target is the sale price. If you cannot download it right now, use this simplified version to follow along: import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import plotly.express as px np.random.seed(42) n = 500 df = pd.DataFrame({ "SalePrice": np.random.normal(180000, 50000, n).clip(50000, 500000), "GrLivArea": np.random.normal(1500, 400, n).clip(500, 4000), "YearBuilt": np.random.randint(1900, 2010, n), "OverallQual": np.random.randint(1, 11, n), "GarageCars": np.random.choice([0, 1, 2, 3], n, p=[0.1, 0.2, 0.5, 0.2]), "TotalBsmtSF": np.random.normal(1000, 300, n).clip(0, 3000), "Neighborhood": np.random.choice(["A", "B", "C", "D", "E"], n), "HouseStyle": np.random.choice(["1Story", "2Story", "1.5Fin"], n), "MasVnrArea": np.random.exponential(100, n).clip(0, 1600), }) df.loc[np.random.choice(n, 30, replace=False), "TotalBsmtSF"] = np.nan df.loc[np.random.choice(n, 10, replace=False), "GrLivArea"] = np.random.uniform(8000, 15000, 10) Enter fullscreen mode Exit fullscreen mode Phase 1: First Contact The very first thing. No charts yet. Just numbers and structure. print("=" * 60) print("DATASET OVERVIEW") print("=" * 60) print(f"Shape: {df.shape}") print(f"Memory: {df.memory_usage(deep=True).sum() / 1024:.1f} KB\n") print("Column Types:") print(df.dtypes.value_counts()) print("\nFirst 5 rows:") print(df.head()) print("\nBasic Statistics:") print(df.describe().round(2)) Enter fullscreen mode Exit fullscreen mode What you are looking for at this stage: How many rows and columns. A dataset with 500 rows and 79 features is very wide relative to its length. That matters for modeling. Dtypes. Are numerical columns actually numerical? Are categorical columns stored as strings or codes? The min and max values in describe(). An age of -5 or a salary of 10 billion tells you something went wrong. Spot it here. The difference between mean and median (50%). Large differences signal skewness or outliers. Phase 2: Missing Values Map missing = df.isnull().sum() missing_pct = (missing / len(df) * 100).round(1) missing_df = pd.DataFrame({ "missing_count": missing, "missing_pct":…

This excerpt is published under fair use for community discussion. Read the full article at DEV Community.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Email

Discussion

0 comments

More from DEV Community