Python Interview Questions
Python
Web DevelopmentFrontendBackendData ScienceQuestion 25
How do you use NumPy and Pandas for data manipulation?
Answer:
NumPy and Pandas are powerful libraries for data manipulation in Python. Hereβs a guide on how to use these libraries effectively:
NumPy for Data Manipulation
NumPy is a fundamental library for numerical computing in Python. It provides support for arrays, matrices, and many mathematical functions.
Creating Arrays
import numpy as np
# Creating a 1D array
array_1d = np.array([1, 2, 3, 4, 5])
# Creating a 2D array
array_2d = np.array([[1, 2, 3], [4, 5, 6]])
# Creating arrays with specific values
zeros_array = np.zeros((3, 3)) # 3x3 array of zeros
ones_array = np.ones((2, 4)) # 2x4 array of ones
range_array = np.arange(10) # Array of values from 0 to 9
linspace_array = np.linspace(0, 1, 5) # 5 values evenly spaced between 0 and 1
Array Operations
# Element-wise operations
array = np.array([1, 2, 3, 4])
print(array + 2) # Output: [3 4 5 6]
print(array * 2) # Output: [2 4 6 8]
# Mathematical functions
print(np.sqrt(array)) # Output: [1. 1.41421356 1.73205081 2. ]
print(np.exp(array)) # Output: [2.71828183 7.3890561 20.08553692 54.59815003]
# Statistical operations
print(np.mean(array)) # Output: 2.5
print(np.sum(array)) # Output: 10
print(np.std(array)) # Output: 1.118033988749895
Indexing and Slicing
array = np.array([1, 2, 3, 4, 5])
# Indexing
print(array[0]) # Output: 1
# Slicing
print(array[1:4]) # Output: [2 3 4]
print(array[:3]) # Output: [1 2 3]
print(array[::2]) # Output: [1 3 5]
# Boolean indexing
print(array[array > 2]) # Output: [3 4 5]
Reshaping and Aggregation
array = np.arange(12).reshape((3, 4))
# Reshape
print(array)
# Output:
# [[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]]
# Aggregation functions
print(np.sum(array, axis=0)) # Sum by columns
# Output: [12 15 18 21]
print(np.sum(array, axis=1)) # Sum by rows
# Output: [ 6 22 38]
Pandas for Data Manipulation
Pandas is a high-level data manipulation tool built on top of NumPy. It provides powerful data structures like DataFrame and Series.
Creating DataFrames
import pandas as pd
# Creating a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'San Francisco', 'Los Angeles']
}
df = pd.DataFrame(data)
# Creating a DataFrame from a CSV file
df = pd.read_csv('data.csv')
Basic DataFrame Operations
# Viewing data
print(df.head()) # View the first 5 rows
print(df.tail()) # View the last 5 rows
print(df.info()) # Summary of the DataFrame
print(df.describe()) # Statistical summary
# Selecting columns
print(df['Name']) # Select a single column
print(df[['Name', 'Age']]) # Select multiple columns
# Selecting rows
print(df.iloc[0]) # Select the first row by index
print(df.loc[0]) # Select the first row by label (index)
# Slicing rows
print(df.iloc[1:3]) # Select rows by index range
print(df.loc[1:3]) # Select rows by label range (index)
# Boolean indexing
print(df[df['Age'] > 30]) # Filter rows based on a condition
Modifying DataFrames
# Adding a new column
df['Salary'] = [70000, 80000, 90000]
# Modifying existing columns
df['Age'] = df['Age'] + 1
# Dropping columns
df = df.drop('City', axis=1)
# Renaming columns
df = df.rename(columns={'Name': 'Full Name', 'Age': 'Years'})
Handling Missing Data
# Detecting missing data
print(df.isnull())
# Dropping rows with missing data
df = df.dropna()
# Filling missing data
df = df.fillna(0)
# Filling missing data with specific values
df['Age'] = df['Age'].fillna(df['Age'].mean())
Grouping and Aggregation
# Group by a column and compute aggregate statistics
grouped = df.groupby('City').agg({'Age': 'mean', 'Salary': 'sum'})
# Pivot tables
pivot_table = df.pivot_table(values='Salary', index='City', columns='Gender', aggfunc='mean')
Combining DataFrames
# Concatenating DataFrames
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
result = pd.concat([df1, df2])
# Merging DataFrames
df1 = pd.DataFrame({'Key': ['A', 'B', 'C'], 'Value1': [1, 2, 3]})
df2 = pd.DataFrame({'Key': ['A', 'B', 'D'], 'Value2': [4, 5, 6]})
result = pd.merge(df1, df2, on='Key', how='inner')
Summary
- NumPy: Ideal for numerical computations, array manipulations, mathematical operations, and handling large datasets efficiently.
- Pandas: Built on top of NumPy, provides powerful and flexible data structures like DataFrame and Series for data manipulation, analysis, and handling heterogeneous data.
These libraries, used together, offer a robust toolkit for data manipulation and analysis in Python, enabling efficient and effective data processing workflows.