Orientation to Computing — II

Unit 1: Data Science & Big Data

From raw data to actionable dashboards — master the data science lifecycle, Big Data tools, and start earning by building real dashboards for Indian businesses.

⏱️ Time to Complete: 8–10 hours | 💰 Earning Potential: ₹5,000–₹15,000/month | 📝 30 MCQs (Bloom's Mapped)

💼 Jobs this unlocks: Data Analyst (₹4–6 LPA) | Junior Data Scientist (₹6–10 LPA) | MIS Executive (₹3–5 LPA)

Section A

Opening Hook — The Data Behind India's Digital Revolution

🏢 How Zomato Knows What You'll Order Before You Do

Every time you open Zomato, a complex data engine fires up. It analyses your past orders, time of day, weather in your city, trending restaurants nearby, and even how long you browse before ordering. This isn't magic — it's data science. Zomato processes over 2 million orders per day across 500+ Indian cities. Their recommendation engine — powered by Big Data analytics — increases order conversion by 35%.

Behind the scenes, teams of data analysts at Zomato's Gurugram HQ use tools like Hadoop, Spark, Tableau, and Python to process terabytes of data daily. They predict delivery times within 2-minute accuracy, optimise delivery partner routes, and decide which restaurant banner you see first.

What if YOU had built this? What if you could take raw data — messy, unstructured, massive — and turn it into insights that drive a ₹10,000 crore business? That's exactly what this chapter teaches you.

🇮🇳 Zomato🇮🇳 Flipkart🇮🇳 Reliance Jio🇮🇳 Paytm🇮🇳 Swiggy🇮🇳 UIDAI (Aadhaar)

India generates 20% of the world's data but has only 5% of the world's data scientists. This means massive demand and fewer competitors. A data-literate student in India has an extraordinary career advantage right now. The Indian data analytics market is expected to reach $118 billion by 2030 (NASSCOM, 2024).

Section B

Learning Outcomes — Bloom's Taxonomy Mapped

Bloom's Level	Learning Outcome
🔵 Remember	List the 7 stages of the data science lifecycle and define Volume, Velocity, and Variety
🔵 Understand	Explain how Hadoop's HDFS and MapReduce work together to process Big Data, using Indian examples
🟢 Apply	Build a data dashboard in Google Sheets using real Indian census data with charts and filters
🟢 Analyze	Compare Hadoop vs Spark vs cloud-based analytics and determine which suits different Indian business scenarios
🟠 Evaluate	Assess the data privacy and ethical challenges in India's Aadhaar system and propose safeguards
🟠 Create	Design a complete data pipeline proposal for a local Indian business, from data collection to dashboard delivery

Section C

Concept Explanation — Data Science & Big Data from Scratch

1. The Data Science Lifecycle

Think of data science like cooking a meal. You don't just throw ingredients into a pot randomly. You first decide what to cook (problem definition), go to the market (data collection), wash and chop vegetables (data cleaning), taste and adjust (analysis), cook the dish (modelling), serve it (deployment), and get feedback (monitoring). Data science follows the exact same logical flow.

📊 The 7 Stages of the Data Science Lifecycle

Stage 1 — Problem Definition: What business question are we answering? Example: "Why are Swiggy deliveries taking 45+ minutes in Pune on weekends?"

Stage 2 — Data Collection: Gathering raw data from databases, APIs, web scraping, sensors, surveys. Swiggy collects GPS data, order timestamps, restaurant prep times, traffic data.

Stage 3 — Data Cleaning (Wrangling): Real data is messy. Missing values, duplicates, wrong formats. 60–80% of a data scientist's time goes here. Think of it as removing stones from rice before cooking.

Stage 4 — Exploratory Data Analysis (EDA): Visualise the data. Find patterns. "Aha, weekend delays spike between 7–9 PM in areas with narrow lanes!" Use charts, histograms, scatter plots.

Stage 5 — Modelling: Apply algorithms. Build a prediction model: "Given restaurant location, time, weather, predict delivery time." This is where machine learning often enters.

Stage 6 — Deployment: Put the model into production. Now Swiggy's app shows you accurate ETAs in real-time.

Stage 7 — Monitoring & Maintenance: Track model accuracy. Retrain when data patterns change (festivals, new roads, COVID lockdowns).

Flipkart's data science team follows this exact lifecycle. During Big Billion Days, they predict which products will sell most in which pincode, pre-position inventory in nearby warehouses (demand forecasting), and dynamically price items — all using data science pipelines running on Apache Spark clusters.

Now YOU try it → Think of a problem at your college (e.g., "Why is the mess food wasted on Fridays?"). Write down what data you'd collect, how you'd clean it, and what chart you'd make. That's Stage 1–4 already done!

2. The 3Vs of Big Data — Volume, Velocity, Variety

Not all data is "Big Data." Your college attendance register is just regular data. But when we talk about Aadhaar managing biometric records of 1.4 billion Indians, or UPI processing 10 billion transactions per month — that's Big Data. What makes data "big"?

V	Meaning	Indian Example	Scale
Volume	Sheer amount of data	Aadhaar biometric database	~30 petabytes (30 million GB)
Velocity	Speed of data generation	UPI transactions via PhonePe/GPay	~3,800 transactions per second
Variety	Different formats of data	Social media (text + images + video + location)	Structured + unstructured + semi-structured

The 3Vs are now 5Vs in industry. Modern frameworks add Veracity (accuracy — is the data trustworthy?) and Value (does the data actually help make decisions?). In interviews, mentioning 5Vs shows you're up-to-date.

Reliance Jio generates 25+ exabytes of data annually from its 450 million subscribers. That's 25 billion GB — equivalent to streaming Netflix continuously for 500 million years. They use this data for network optimisation, personalised content recommendations on JioTV, and targeted advertising on JioAds.

Analogy: Think of Big Data like the Kumbh Mela. Volume = 200 million pilgrims. Velocity = thousands arriving every minute. Variety = they come by train, bus, foot, boat — carrying different languages, needs, and demographics. Managing this crowd requires "Big Crowd" techniques, just like managing Big Data requires specialised tools.

Now YOU try it → List 3 Indian organisations that deal with Big Data and identify which V is their biggest challenge.

3. Big Data Challenges

Having lots of data isn't automatically useful. It brings serious challenges:

Challenge	What It Means	Indian Example
Storage	Where do you keep petabytes of data?	UIDAI needed specialised data centres across India for Aadhaar
Processing	Traditional computers can't handle it. Need distributed computing.	IRCTC processes 25 million ticket requests on Tatkal day — single servers crash
Security	Sensitive data must be protected from breaches	Aadhaar data leaks controversy; DPDP Act 2023
Privacy	Ethical use of personal data	Should Paytm know your spending habits and sell this to advertisers?
Quality	Garbage in, garbage out	Indian census data has inconsistencies across states due to language differences
Talent Gap	India needs 1 million+ data professionals by 2026 (NASSCOM)	Only 11% of Indian engineering graduates are trained in data skills

Students confuse "more data" with "better insights." A messy dataset of 10 million rows can give worse results than a clean dataset of 10,000 rows. Quality always trumps quantity. Always clean before you analyse.

4. Tools Deep-Dive: Hadoop & Apache Spark

Hadoop — The Foundation of Big Data Processing

Plain English: Imagine you have a 1,000-page book and need to count how many times the word "India" appears. One person reading alone takes 10 hours. But if you tear the book into 100 chunks and give each chunk to a different person, they all count simultaneously and report back — done in 6 minutes. That's Hadoop.

Technical Definition: Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It has three core components:

Component	What It Does	Analogy
HDFS (Hadoop Distributed File System)	Stores data across multiple machines with replication	Like storing copies of your notes in 3 different lockers — if one locker breaks, you don't lose anything
MapReduce	Processes data in parallel: Map (split & process) → Reduce (combine results)	Like a class test where each row of students grades one question, then the teacher combines all scores
YARN (Yet Another Resource Negotiator)	Manages cluster resources — decides which job gets how much CPU/RAM	Like a hostel warden allocating rooms — ensuring everyone gets fair space

Flipkart runs one of India's largest Hadoop clusters — over 5,000 nodes processing 50+ petabytes of data. They use it for product recommendations, search ranking, fraud detection, and supply chain optimisation. When you search "blue shirt" on Flipkart, Hadoop clusters process your query against millions of products in milliseconds.

Apache Spark — The Faster Alternative

Hadoop's MapReduce writes intermediate results to disk (slow). Spark keeps them in memory (RAM) — making it 100× faster for many tasks.

Feature	Hadoop MapReduce	Apache Spark
Speed	Slower (disk-based)	100× faster (in-memory)
Ease of Use	Complex Java code	Simple Python/Scala APIs
Real-time Processing	❌ Batch only	✅ Batch + Streaming
Best For	Massive batch jobs (overnight processing)	Real-time analytics, ML pipelines
Used By (India)	TCS, Infosys legacy systems	Swiggy, Razorpay, Dream11

In 2024 job interviews, Spark skills are more valued than Hadoop MapReduce. Most Indian companies have migrated to Spark or cloud-native solutions. Learn PySpark (Spark with Python) — it's the most in-demand Big Data skill on Naukri.com and LinkedIn India.

5. Visualisation & Analysis Tools: Tableau, R, and Excel

Tableau — See Your Data Come Alive

Plain English: Tableau is like Instagram filters for data. You drag-and-drop columns, and it instantly creates beautiful charts, maps, and dashboards. No coding required.

Technical: Tableau is a visual analytics platform that connects to databases (SQL, Excel, CSV, cloud sources) and lets you create interactive dashboards through a drag-and-drop interface.

Free version: Tableau Public — completely free, unlimited dashboards, hosted on Tableau's cloud. Perfect for students building portfolios.

R Programming — The Statistician's Weapon

R is a programming language built specifically for statistics and data analysis. Think of it as Excel on steroids — you can do everything Excel does, plus advanced statistical modelling, machine learning, and publication-quality visualisations.

R
# Load Indian population data and create a visualization
data <- read.csv("india_census_2011_2024.csv")
library(ggplot2)
ggplot(data, aes(x=Year, y=Population, fill=State)) +
  geom_bar(stat="identity") +
  labs(title="India Population Growth by State")

Excel — Don't Underestimate the Classic

Every data career starts with Excel. 90% of Indian businesses still use Excel for data analysis. Pivot tables, VLOOKUP, conditional formatting, and charts are non-negotiable skills. Even data scientists at Flipkart use Excel for quick ad-hoc analysis before firing up Python.

Excel dashboard creation is the easiest entry point to earning. Local shops, schools, clinics, and coaching centres in India need dashboards for attendance, sales, and inventory. You can build these in 2–4 hours and charge ₹2,000–₹8,000 per project on Internshala. No advanced coding needed — just Excel + Google Sheets.

6. Big Data on Cloud

Analogy: Setting up your own Hadoop cluster is like buying a diesel generator for your home. It works, but it's expensive, noisy, and you maintain it yourself. Cloud-based Big Data is like using the electricity grid — you just plug in and pay for what you use.

Cloud Service	Provider	What It Does	Free Tier?
AWS EMR	Amazon	Managed Hadoop/Spark clusters	Limited free tier
Google BigQuery	Google	Serverless data warehouse — query petabytes with SQL	✅ 1 TB/month free
Azure HDInsight	Microsoft	Managed Hadoop, Spark, Kafka	Free credits for students
Databricks	Databricks	Unified analytics platform (Spark-based)	Community edition free

Paytm processes 1.5 billion+ transactions monthly using Google Cloud's BigQuery for real-time fraud detection. When you pay ₹50 for chai via Paytm, BigQuery analyses the transaction pattern against millions of historical transactions in under 200 milliseconds to flag potential fraud.

7. Job Roles & Career Paths in Data Science

Role	What They Do	Key Skills	Entry Salary (India)
Data Analyst	Clean data, create dashboards, generate reports	Excel, SQL, Tableau, basic Python	₹4–6 LPA
Data Scientist	Build predictive models, run experiments	Python, R, ML, Statistics, SQL	₹6–12 LPA
Data Engineer	Build data pipelines, manage infrastructure	SQL, Spark, Airflow, AWS/GCP, Python	₹6–10 LPA
ML Engineer	Deploy ML models to production	Python, TensorFlow, Docker, APIs	₹8–15 LPA
Business Analyst	Translate business needs to data requirements	Excel, SQL, PowerBI, Communication	₹4–7 LPA
MIS Executive	Manage information systems, daily/weekly reports	Excel, VBA, SQL basics	₹3–5 LPA

The fastest path for a BCA/B.Tech student: Start as a Data Analyst (Excel + SQL + Tableau), build a portfolio of 3–5 dashboards, then learn Python for Data Science. This path can land you a ₹4–6 LPA job within 6 months of graduation. Companies hiring: TCS, Infosys, Wipro, Mu Sigma, Fractal Analytics, Tiger Analytics.

Now YOU try it → Go to Naukri.com and search "Data Analyst fresher." Note the top 5 skills mentioned in job descriptions. You'll find Excel, SQL, and Tableau appear in 90% of them.

Section D

Learn by Doing — 3-Tier Lab Structure

🟢 Tier 1 — GUIDED TASK: Build an India Population Dashboard in Google Sheets

⏱️ 60–90 minutesBeginnerZero prior knowledge assumed

Step 1: Open Google Sheets

Go to sheets.google.com → Click "+ Blank spreadsheet"

Step 2: Enter Indian Census Data

Create the following table (Column A = State, Column B = Population 2011, Column C = Population 2024 Est.):

State	Pop. 2011 (Cr)	Pop. 2024 (Cr)
Uttar Pradesh	19.98	23.5
Maharashtra	11.24	12.8
Bihar	10.41	13.1
West Bengal	9.13	10.2
Madhya Pradesh	7.27	8.7
Tamil Nadu	7.21	7.8
Rajasthan	6.86	8.2
Karnataka	6.11	7.0
Gujarat	6.04	7.1
Kerala	3.34	3.5

Step 3: Add a Growth Rate Column

In Column D, type the header "Growth %". In cell D2, enter: =(C2-B2)/B2*100. Drag the formula down for all rows.

Step 4: Create a Bar Chart

Select columns A, B, and C (all rows)
Click Insert → Chart
Choose "Grouped Bar Chart"
Title it: "India Population Growth by State (2011 vs 2024)"
Under "Customize" → change the 2011 colour to blue and 2024 to orange

Step 5: Create a Growth Rate Pie Chart

Select columns A and D
Insert → Chart → Pie Chart
Title: "Population Growth Rate (%) — Top 10 Indian States"

Step 6: Add Filters & Conditional Formatting

Select all data → Data → Create a filter
Select Column D → Format → Conditional formatting → "Greater than 20" → Set to red background

Step 7: Name your Dashboard Tab

Right-click the sheet tab → Rename to "India Population Dashboard"

🎉 Congratulations! You've built your first data dashboard. Take a screenshot — this is your first portfolio piece.

🟡 Tier 2 — SEMI-GUIDED TASK: Tableau Public Dashboard with COVID-19 India Data

⏱️ 90–120 minutesIntermediateHints provided, you fill the gaps

Your Mission:

Create an interactive COVID-19 India dashboard on Tableau Public showing state-wise case trends.

Hints:

Data Source: Download India COVID-19 data from api.covid19india.org (archived) or Kaggle (search "COVID-19 India dataset")
Tool: Download Tableau Public (free) from public.tableau.com
Connect: Open Tableau Public → Connect → Text File (CSV)
Dashboard Elements: You need:
- A line chart showing daily cases over time
- A map of India showing state-wise total cases (use the "filled map" chart type)
- A bar chart showing top 10 states by recovery rate
- A filter for date range
Publish: Save to Tableau Public → Get a shareable URL for your portfolio

Stretch Goal: Add a calculated field for "Recovery Rate %" = (Recovered / Confirmed) × 100. Which state has the best recovery rate?

🔴 Tier 3 — OPEN CHALLENGE: Data Pipeline Proposal for a Local Indian Business

⏱️ 2–3 hoursAdvancedNo instructions — real-world mini-project

The Brief:

Choose a real local business (your college canteen, a neighbourhood kirana store, a local coaching centre, or a small clinic). Design a complete data pipeline proposal covering:

Problem Statement: What business question will you answer?
Data Collection Plan: What data do you need? How will you collect it? (Google Forms, manual entry, POS system)
Data Cleaning Strategy: What will be messy? How will you fix it?
Analysis Plan: What charts/metrics will you create?
Dashboard Mockup: Sketch (hand-drawn is fine) showing layout
Tools: Google Sheets/Excel for analysis, Tableau Public for dashboard
Business Impact: How will this save money or increase revenue?
Budget: Total cost (hint: it should be ₹0 if using free tools)

Deliverable: A 3–5 page Google Doc proposal. This becomes a real portfolio piece and can be your first freelance pitch.

This exact proposal format is what freelancers send to clients. Polish it well, and you can send it to local businesses on WhatsApp/email offering ₹3,000–₹8,000 dashboard services. Many students have landed their first paying client from this exercise.

Section E

Industry Spotlight — A Day in the Life

👩‍💻 Priya Sharma, 26 — Data Analyst at Flipkart, Bangalore

Background: BCA from Chandigarh University. No coding experience before college. Self-taught Excel and SQL in 2nd year. Built 5 Tableau dashboards during internship at a local CA firm. Got placed at Flipkart through campus recruitment.

A Typical Day:

9:00 AM — Morning standup with the marketplace analytics team. Review yesterday's metrics: GMV (Gross Merchandise Value), seller performance, return rates.

10:00 AM — Pull data from Flipkart's data warehouse using SQL queries. "Show me top 20 sellers in electronics category with return rate > 15% in last 30 days."

11:30 AM — Clean the data in Python (pandas). Remove duplicates, handle null values, standardise seller names.

1:00 PM — Lunch at Flipkart's cafeteria. Discuss a new A/B test design with the product team.

2:00 PM — Build a Tableau dashboard showing seller quality scores. Present to the category manager: "These 5 sellers need quality warnings."

4:00 PM — Write a data quality report in Google Sheets. Add conditional formatting for KPIs.

5:30 PM — Learn PySpark on Databricks (company-sponsored learning hour). Working towards Data Scientist role.

Detail	Info
Tools Used Daily	SQL (BigQuery), Python (pandas, matplotlib), Tableau, Excel, Jira
Entry Salary (2024)	₹4–6 LPA + benefits
Mid-Level (3–5 yrs)	₹8–15 LPA
Senior (7+ yrs)	₹18–35 LPA
Companies Hiring	Flipkart, Swiggy, Zomato, Paytm, Meesho, TCS, Infosys, Mu Sigma, Fractal Analytics, Tiger Analytics, Latent View

Section F

Earn With It — Freelance & Income Roadmap

💰 Your Earning Path After This Chapter

Portfolio Piece: "India Population Trend Dashboard 2011–2024" — a polished Google Sheets/Tableau dashboard with charts, filters, and growth analysis.

Beginner Gig Ideas:

• Excel/Google Sheets dashboard for local coaching centres (attendance + fee tracking) — ₹2,000–₹5,000

• Sales data dashboard for small retailers — ₹3,000–₹8,000

• Survey data analysis for NGOs/student projects — ₹1,500–₹4,000

• Monthly expense tracker dashboard for small businesses — ₹2,000–₹6,000

Platform	Best For	Typical Rate
Internshala	Indian student internships & freelance projects	₹2,000–₹8,000/project
Fiverr	Global clients, quick dashboard gigs	$10–$50/gig (₹800–₹4,000)
Upwork	Longer projects, higher rates	$15–$40/hour
LinkedIn	Direct outreach to Indian businesses	₹3,000–₹10,000/project
WhatsApp/Local	Nearby shops, schools, clinics	₹2,000–₹8,000/project

⏱️ Time to First Earning: 2–3 weeks (if you complete Tier 1 lab and reach out to 10 local businesses)

Start local, go global. Your first client won't come from Fiverr — it'll be a coaching centre owner you know, or your parent's friend who runs a shop. Build 2–3 free dashboards for people you know. Take screenshots. Then create your Fiverr/Internshala gig with real portfolio samples.

Section G

MCQ Assessment Bank — 30 Questions (Bloom's Mapped)

Remember / Identify (Q1–Q5)

Which of the following is NOT one of the 3Vs of Big Data?

Volume
Velocity
Validity
Variety

Remember

✅ Answer: (C) Validity — The original 3Vs are Volume, Velocity, and Variety. Validity is not part of the standard 3V framework (Veracity and Value are sometimes added as the 4th and 5th V).

HDFS stands for:

Hadoop Data File Storage
Hadoop Distributed File System
High Data Flow System
Hadoop Dynamic File Structure

Remember

✅ Answer: (B) — HDFS = Hadoop Distributed File System. It distributes and replicates data across multiple nodes in a cluster.

Which stage of the data science lifecycle involves removing duplicates and handling missing values?

Data Collection
Modelling
Data Cleaning
Deployment

Remember

✅ Answer: (C) Data Cleaning — Also called data wrangling, this stage handles messy, incomplete, and duplicate data. It takes 60–80% of a data scientist's time.

Tableau is primarily used for:

Database management
Data visualisation and dashboards
Machine learning model training
Web development

Remember

✅ Answer: (B) — Tableau is a visual analytics platform for creating interactive dashboards and reports through drag-and-drop.

YARN in Hadoop stands for:

Yet Another Resource Negotiator
Yearly Analysis Resource Node
YAML-based Application Resource Network
Yield Adjusted Resource Navigator

Remember

✅ Answer: (A) — YARN = Yet Another Resource Negotiator. It manages and schedules resources across the Hadoop cluster.

Understand / Explain (Q6–Q10)

Why is the data cleaning stage considered the most time-consuming in the data science lifecycle?

It requires expensive software
Real-world data has inconsistencies, missing values, and duplicates that must be fixed before analysis
It needs approval from management
It involves writing machine learning algorithms

Understand

✅ Answer: (B) — Real-world data is messy. Missing entries, typos, inconsistent formats (e.g., "Mumbai" vs "Bombay" vs "mumbai"), and duplicates must all be resolved before any meaningful analysis.

How does HDFS ensure data safety when a node in the cluster fails?

It compresses all data before storage
It stores encrypted backups on the cloud
It replicates each data block across multiple nodes (default: 3 copies)
It uses RAID arrays on each machine

Understand

✅ Answer: (C) — HDFS replicates each block across 3 nodes by default. If one node fails, the data is still available from the other copies. This is called fault tolerance.

Which V of Big Data does Aadhaar's 1.4 billion biometric records best represent?

Velocity
Variety
Volume
Veracity

Understand

✅ Answer: (C) Volume — Aadhaar stores ~30 petabytes of fingerprints, iris scans, and demographic data for 1.4 billion people — a massive volume challenge.

What is the key advantage of Apache Spark over Hadoop MapReduce?

Spark is free while Hadoop is paid
Spark processes data in-memory, making it up to 100× faster
Spark can only process structured data
Spark doesn't need a cluster

Understand

✅ Answer: (B) — Spark keeps intermediate data in RAM (memory) instead of writing to disk like MapReduce, making iterative algorithms and real-time analytics dramatically faster.

Q10

Explain why Google BigQuery is described as "serverless":

It runs without electricity
Users don't manage any server infrastructure — Google handles everything
It can only be used offline
It doesn't store any data

Understand

✅ Answer: (B) — "Serverless" means you don't provision, manage, or maintain servers. You just write SQL queries and Google's infrastructure handles execution, scaling, and storage automatically.

Apply / Demonstrate (Q11–Q15)

Q11

A Pune coaching centre wants to track student attendance, fee payments, and test scores. Which tool would you recommend for a non-technical owner?

Apache Hadoop
Google Sheets with Pivot Tables
TensorFlow
MongoDB

Apply

✅ Answer: (B) — Google Sheets is free, cloud-based, requires no installation, and supports pivot tables, charts, and conditional formatting — perfect for a non-technical user's dashboard needs.

Q12

You're building a sales dashboard and the "City" column has entries like "Delhi", "delhi", "New Delhi", "DELHI". Which data science lifecycle stage addresses this?

Deployment
Modelling
Data Cleaning
Problem Definition

Apply

✅ Answer: (C) Data Cleaning — Standardising inconsistent text entries (case normalisation, removing duplicates) is a classic data cleaning task.

Q13

To calculate growth rate percentage in Google Sheets where B2=2011 population and C2=2024 population, the correct formula is:

=(B2-C2)/C2*100
=(C2-B2)/B2*100
=C2/B2
=(C2+B2)/2*100

Apply

✅ Answer: (B) — Growth % = (New - Old) / Old × 100. So (C2-B2)/B2*100 gives the percentage increase from 2011 to 2024.

Q14

Swiggy wants to identify which restaurants have the highest cancellation rates in Mumbai. Which visualisation type is most appropriate?

Pie chart of all restaurants
Horizontal bar chart — top 20 restaurants sorted by cancellation rate
Scatter plot of revenue vs profit
Line chart of daily temperatures

Apply

✅ Answer: (B) — A sorted horizontal bar chart makes it easy to compare cancellation rates across restaurants and quickly identify the worst performers.

Q15

A freelance client asks you to create a dashboard showing monthly expenses across 5 categories. You have data in CSV format. Describe the correct workflow:

Build an ML model → Deploy on AWS → Send link
Import CSV into Google Sheets → Clean data → Create pivot table → Build charts → Share link
Print the CSV and highlight numbers manually
Upload to GitHub and write Python code

Apply

✅ Answer: (B) — For a simple client dashboard, Google Sheets is the fastest and most client-friendly approach. Import, clean, pivot, chart, share — done in 2 hours.

Analyze / Compare (Q16–Q20)

Q16

Compare Hadoop MapReduce and Apache Spark. In which scenario would Hadoop MapReduce still be preferred over Spark?

Real-time fraud detection at Razorpay
Processing overnight batch jobs on a low-RAM cluster with petabytes of log data
Building a recommendation engine for Swiggy
Interactive data exploration in Jupyter notebooks

Analyze

✅ Answer: (B) — MapReduce writes to disk, so it works well on clusters with limited RAM. For massive overnight batch processing where speed isn't critical, MapReduce's disk-based approach is more cost-effective.

Q17

Differentiate between structured, semi-structured, and unstructured data with Indian examples:

All data is structured if stored in a computer
Structured = Excel/SQL tables (Aadhaar records); Semi-structured = JSON/XML (Swiggy API responses); Unstructured = images, videos, tweets (Instagram posts)
Unstructured data cannot be analysed
Semi-structured data is always stored in Hadoop

Analyze

✅ Answer: (B) — Structured data has a fixed schema (rows/columns), semi-structured has tags/keys but flexible format, and unstructured has no predefined format.

Q18

Examine why UPI transaction data represents the "Velocity" V of Big Data rather than "Volume":

UPI doesn't generate much data
UPI data is generated at extreme speed (3,800+ transactions per second), requiring real-time processing rather than just large storage
UPI only processes text data
UPI uses Hadoop for storage

Analyze

✅ Answer: (B) — While UPI generates large volumes too, its defining characteristic is speed — thousands of transactions per second that must be validated, processed, and settled in near real-time.

Q19

Compare Excel and Tableau for a freelance data dashboard project. When would you choose Excel over Tableau?

When the client needs interactive web-based dashboards
When the client already uses Excel daily and needs a simple solution they can maintain themselves
When the dataset exceeds 50 million rows
When geographic map visualisations are required

Analyze

✅ Answer: (B) — If a client (say a local shop owner) already knows Excel, building a dashboard there means they can update and maintain it without your help. Tableau requires a separate tool they may not be familiar with.

Q20

Analyse the difference between a Data Analyst and a Data Scientist role at an Indian IT company:

They are the same role with different titles
Data Analysts focus on descriptive analytics (what happened), while Data Scientists build predictive models (what will happen)
Data Scientists only use Excel
Data Analysts earn more than Data Scientists

Analyze

✅ Answer: (B) — Analysts answer "what happened?" using SQL, dashboards, and reports. Scientists answer "what will happen?" using ML models, statistical modelling, and experimentation.

Evaluate / Judge (Q21–Q25)

Q21

UIDAI claims Aadhaar data is completely secure. A journalist discovers that Aadhaar numbers of 100 million Indians were leaked through an unsecured government API. Evaluate the failure:

The data was too big to secure
Hadoop was the wrong technology
The failure was in API security and access control, not in the database technology itself
Aadhaar should not collect data at all

Evaluate

✅ Answer: (C) — The technology (database) was not the issue. The vulnerability was in how APIs exposed data without proper authentication and rate limiting. Security is a design and governance problem, not just a technology problem.

Q22

A startup CEO says: "We don't need data cleaning. Our ML model will learn from noisy data." Judge this statement:

Correct — modern ML handles everything automatically
Incorrect — "Garbage in, garbage out" applies. Noisy data leads to inaccurate models with poor predictions
Partially correct — only deep learning needs clean data
Correct if they use Hadoop

Evaluate

✅ Answer: (B) — This is a dangerous misconception. Even the best ML algorithm will produce unreliable results if trained on dirty data. Data cleaning is non-negotiable.

Q23

Paytm uses customer transaction data to show personalised loan offers. Evaluate the ethical implications:

No ethical issues — it's just marketing
Potential concerns: user consent, data privacy under DPDP Act, targeting vulnerable users with high-interest loans, algorithmic bias
Ethical only if Paytm earns profit
Unethical because Paytm is a private company

Evaluate

✅ Answer: (B) — Using financial data for targeted marketing raises consent, privacy, and fairness concerns. The DPDP Act 2023 requires explicit consent. Predatory lending targeting low-income users is an ethical red flag.

Q24

An Indian company has 500 GB of sales data and is considering building an on-premise Hadoop cluster vs using Google BigQuery. Justify the better choice:

Hadoop — because it's open-source and free
BigQuery — because 500 GB is small, BigQuery's free tier covers it, no infrastructure management needed, and results come in seconds
Neither — Excel can handle 500 GB
Hadoop — because it's faster than cloud

Evaluate

✅ Answer: (B) — 500 GB is too small to justify the cost/complexity of a Hadoop cluster (designed for petabytes). BigQuery processes 500 GB in seconds, costs almost nothing, and requires zero infrastructure setup.

Q25

Critique the claim: "India doesn't need data scientists because AI will automate all analysis."

Valid — AI replaces all human analysis
Invalid — AI automates routine tasks but humans are needed for problem framing, ethical judgement, domain expertise, and communicating insights to business stakeholders
Valid for small companies only
Invalid because AI doesn't work in India

Evaluate

✅ Answer: (B) — AI tools like ChatGPT can write SQL or generate charts, but they cannot define the right business problem, ensure ethical data use, or present findings convincingly to a non-technical CEO. The human role evolves, not disappears.

Create / Design (Q26–Q30)

Q26

Design a data collection strategy for a college canteen that wants to reduce food waste. Which approach is most comprehensive?

Ask students to fill a Google Form once a year
Track daily menu items, quantity prepared, quantity consumed, leftover weight, day of week, weather, and special events — using a simple Google Sheet updated by canteen staff
Install an AI camera system costing ₹10 lakh
Count the number of students entering the canteen

Create

✅ Answer: (B) — This captures the right variables (what was made, what was wasted, contextual factors) using a free tool. It's practical, low-cost, and gives actionable insights when analysed with pivot tables and charts.

Q27

Propose a dashboard design for IRCTC to display real-time train booking statistics. What elements should it include?

Just a table with numbers
Live counter of bookings/minute, heat map of top routes, bar chart of class-wise bookings (1AC/2AC/SL), alerts for high-demand trains, and a trend line of daily bookings
A pie chart of all 13,000 trains
A single number showing total bookings today

Create

✅ Answer: (B) — A comprehensive dashboard combines real-time metrics (counters), geographic context (heat maps), categorical breakdowns (bar charts), anomaly detection (alerts), and temporal trends (line charts).

Q28

You're creating a Fiverr gig for "Excel Dashboard Creation for Indian Small Businesses." Which gig description is most likely to attract clients?

"I know Excel. Contact me."
"I will create a professional, interactive Excel dashboard with charts, pivot tables, and conditional formatting for your sales, inventory, or financial data. Delivery in 48 hours. Includes 1 revision. Starting at ₹2,000."
"Expert in Hadoop and Spark. Can build anything."
"Free dashboards for everyone!"

Create

✅ Answer: (B) — A good gig description is specific (what you'll do), includes deliverables (charts, pivot tables), sets expectations (48 hours, 1 revision), and has a clear price. This is professional and client-ready.

Q29

Build a mini data pipeline for a neighbourhood kirana store. Which sequence is correct?

Deploy ML model → Collect data → Build dashboard
Define problem (which products expire unsold?) → Collect daily sales data via Google Sheets → Clean & organise → Analyse with pivot tables → Create dashboard showing slow-moving products → Present to owner with recommendations
Buy Hadoop servers → Hire 5 data scientists → Start analysis
Build a mobile app → Collect biometric data → Use AI

Create

✅ Answer: (B) — This follows the data science lifecycle correctly (Problem → Collect → Clean → Analyse → Visualise → Act) using appropriate tools for a small business (Google Sheets, not Hadoop).

Q30

Design a data-driven solution for Zomato to identify restaurants that are likely to shut down within 6 months. What data points would you propose collecting?

Only the restaurant name
Monthly order volume trend, average rating trend, customer complaint frequency, delivery time reliability, menu price changes, competitor density in the area, owner response rate to reviews
Just the current star rating
GPS coordinates only

Create

✅ Answer: (B) — Predicting restaurant closure requires multiple signals: declining orders (revenue trend), dropping ratings (quality trend), increasing complaints (service issues), and contextual factors (competition, pricing). This is a real data science problem at Zomato.

Section H

Short Answer Questions (2–3 marks each)

Question 1 (2 marks)

Define the 3Vs of Big Data and give one Indian example for each.

Question 2 (3 marks)

Explain the difference between HDFS and MapReduce in Hadoop. How do they work together to process large datasets?

Question 3 (2 marks)

Why is data cleaning considered the most time-consuming step in the data science lifecycle? Give two real-world examples of dirty data.

Question 4 (3 marks)

Compare Apache Spark and Hadoop MapReduce on three parameters: speed, ease of use, and real-time processing capability.

Question 5 (2 marks)

List any four job roles in Data Science and mention one key skill required for each role.

Section I

Long Answer & Case Studies (10 marks each)

📋 Case Study 1: Aadhaar — The World's Largest Biometric Database (10 marks)

India's Aadhaar system, managed by UIDAI, stores biometric data (fingerprints, iris scans) and demographic information for over 1.4 billion residents. It processes over 100 million authentication requests daily for services like bank account verification, SIM card activation, and government subsidy disbursement (DBT).

Answer the following:

(2 marks) Identify which V of Big Data is the most significant challenge for Aadhaar. Justify your answer with data.
(3 marks) Describe the data pipeline that runs when a citizen uses Aadhaar for eKYC at a bank. Cover data flow from the fingerprint scanner to the authentication response.
(2 marks) Discuss two data privacy concerns with Aadhaar and how the DPDP Act 2023 addresses them.
(3 marks) Propose a data analytics dashboard for UIDAI that shows: authentication success rates by state, peak usage hours, and failure categories. Sketch the dashboard layout and specify chart types.

📋 Case Study 2: UPI/NPCI — Real-Time Transaction Analytics at 10 Billion Scale (10 marks)

India's Unified Payments Interface (UPI), managed by NPCI, processed 13.89 billion transactions worth ₹20.64 lakh crore in a single month (March 2024). Apps like PhonePe, Google Pay, and Paytm together handle ~3,800 transactions per second during peak hours.

Answer the following:

(2 marks) Explain why UPI's primary Big Data challenge is Velocity rather than Volume. How does this affect technology choices?
(3 marks) Design a real-time fraud detection data pipeline for UPI. Specify: what data points you'd collect per transaction, what analytics you'd run, and what action the system should take when fraud is detected.
(2 marks) Compare how NPCI could use Hadoop vs Spark for transaction analytics. Which is more appropriate and why?
(3 marks) Create a data-driven strategy for NPCI to predict UPI downtime before it happens. What historical data would you analyse? What patterns would indicate impending failure?

Section J

Chapter Summary — Tweet-Sized Bullet Points

📝 Key Takeaways

📊 Data Science lifecycle: Problem → Collect → Clean → Analyse → Model → Deploy → Monitor. Cleaning takes 60-80% of time.
📦 Big Data = 3Vs: Volume (Aadhaar = 30PB), Velocity (UPI = 3,800 txn/sec), Variety (text + images + video).
🐘 Hadoop = distributed storage (HDFS) + parallel processing (MapReduce) + resource management (YARN).
⚡ Spark is 100× faster than MapReduce because it processes data in-memory, not on disk.
📈 Tableau = drag-and-drop visual analytics. Free version (Tableau Public) is perfect for portfolios.
☁️ Cloud Big Data (BigQuery, AWS EMR) eliminates infrastructure hassle. BigQuery gives 1 TB/month free.
💼 Fastest career path: Excel + SQL + Tableau → Data Analyst at ₹4-6 LPA. No coding required initially.
💰 Start earning NOW: Build Excel dashboards for local businesses. First gig: ₹2,000-₹8,000 on Internshala.
🇮🇳 India needs 1 million+ data professionals by 2026 — massive opportunity for BCA/B.Tech students.
🏗️ Portfolio piece from this chapter: "India Population Trend Dashboard 2011-2024" on Google Sheets.

Section K

My Earning Checkpoint — Self-Assessment

Skill Learned	Tool Practised	Portfolio Item Added	Gig Ready?
Data Science Lifecycle	Google Sheets	India Population Dashboard	✅ Yes — can explain lifecycle to clients
3Vs of Big Data	Conceptual	—	✅ Yes — can discuss in interviews
Data Cleaning	Google Sheets / Excel	Cleaned Census Dataset	✅ Yes — cleaning is a billable skill
Dashboard Creation	Google Sheets, Tableau Public	Population Dashboard + COVID Dashboard	✅ Yes — ₹2,000–₹8,000/project
Data Visualisation	Charts, Pivot Tables	Bar charts, Pie charts, Filters	✅ Yes — ready for Internshala gigs
Hadoop/Spark Concepts	Conceptual (no hands-on yet)	—	⬜ Not yet — need PySpark practice
Data Pipeline Design	Google Docs (proposal)	Business Data Pipeline Proposal	✅ Yes — can pitch to local businesses

Minimum Viable Earning Setup after this chapter: A Google Sheets/Tableau portfolio with 2 dashboards + an Internshala/Fiverr profile with a clear gig description = you can earn ₹5,000–₹15,000/month from dashboard gigs while still in college.

✅ Unit 1 complete. Ready for Unit 2: AI & Machine Learning!

[QR: Link to EduArtha video tutorial — Data Science & Big Data]