lab_sets

Sassy Sets

Due: Mar 27, 23:59 PM

Learning Objectives

  • Review the basics of sets in Python
  • Apply sets to common data processing tasks

Submission Instructions

Using the Prairielearn workspace, test and save your solution to the following exercises. (You are welcome to download and work on the files locally but they must be re-uploaded or copied over to the workspace for submission).

Assignment Description

Python sets take advantage of the same conceptual idea as a hash table (efficient hashing of unordered data) but apply this approach to data consisting of only keys rather than (key, value) pairs. Most sets (such as the built-in Python set() object) also have the added constraint that duplicate values are not allowed / not tracked. This allows the Python set to be a great tool for working with large datasets which need to be cleaned in some fashion!

In this lab, you will be tasked with using the built-in Python set to solve a number of common data science problems. You are encouraged to use the ADT interface seen in class or use other built-in set operations to complete the exercises.

Removing Duplicates from a set

# INPUT:
# fname, the filename / path to the file containing multiple words per line
# Output:
# A set of unique words from the file
def removeDuplicates(fname):

Handling duplicate data is a common task in data processing. When reading text files, we often encounter repeated words or lines. Using Python sets, we can efficiently remove duplicates and extract unique elements without the need for loops or additional logic.

In this exercise, you will write a function to read a file and return a set of unique words. This will help familiarize you with file handling and the efficiency of set operations.

Set Membership Testing

# INPUT:
# lst, a list of elements to check
# s, a set in which to check of membership
# OUTPUT:
# a boolean, return True if all elements in lst are present in s, otherwise return False
def isListinSet(lst, s):

In many applications, it is necessary to verify whether a given set contains specific elements. This is especially useful in scenarios such as checking if all required items are present in an inventory, validating user input, or filtering data.

In this exercise, you will implement a function that determines whether all elements in a given list exist within a predefined set.

Consider: Is this approach faster than doing the same check using only lists? Why or why not?

Finding Common Items across Multiple Sets

# INPUT:
# A list of sets, where each set represents a user's liked movies.
# OUTPUT:
# A set containing movies liked by all users.
def commonMovies(sets):

In data science, finding common elements across multiple sets is useful in various scenarios. One practical example is in recommendation systems. Suppose we have multiple sets representing different users’ movie preferences. We can use set intersections to find movies that are commonly liked by a group of users, helping us recommend movies that appeal to all.

Using Python sets, we can find the intersection of multiple sets efficiently, reducing the need for nested loops and improving performance.

Disjoint Sets

# INPUT:
# set1, First set.
# set2, Second set.
# OUTPUT:
# Boolean, True if the set have no elements in common, otherwise return False
def areDisjoint(set1, set2):

In computer networks, checking whether two networks are disjoint is crucial for ensuring efficient routing and avoiding conflicts. For example, in a distributed system, different servers may manage separate clusters of users, and verifying that these clusters do not overlap can help optimize data storage and access.

Another example is in network security, where firewalls and access controls enforce separation between sensitive and public networks. Ensuring that these networks remain disjoint prevents unauthorized access to private data.

In this exercise, you will implement a function that checks whether two sets are disjoint, meaning they share no common elements.

Counting Cardinality (Unique Items)

# INPUT:
# sets, a list of sets containing multiple sets
# OUTPUT:
# A set containing elements that are unique to only one input set.
def unique_elements(sets):

In various applications, it is important to determine which elements appear uniquely in one set but not in any others. This process can be used in customer analytics, text analysis, and various classification problems.

In market research and customer segmentation, companies often analyze purchasing behaviors across multiple groups. For example, given three different customer groups, they might want to identify products that are exclusively purchased by only one group. This can help target niche products to specific demographics.

More broadly, as we progress into further data structures we will find it necessary to be able to count (or estimate) the cardinality – the number of unique items within a single set.

Using Python sets, we can efficiently determine unique elements without needing complex loops or additional data structures. In this exercise, you will implement a function that identifies elements that appear in only one of the given sets.

Anomaly Detection

# INPUT:
# expected_set, a complete set of expected elements
# observed_set, an observed subset of elements
# OUTPUT:
# A set containing elements that are in expected_set but not in observed_set
def detectAnomalies(expected_set, observed_set):

Identifying missing or anomalous elements within a dataset is essential for robust data analysis. Using set operations, we can efficiently detect elements that should be present but are missing, which is useful in fields such as machine learning, financial auditing, and cybersecurity.

In data science, detecting outliers or anomalies is crucial for ensuring data integrity. For example, in fraud detection, transactions that deviate significantly from typical spending patterns can indicate fraudulent activities. Similarly, in sensor networks, identifying readings that differ substantially from expected values can help detect faulty sensors.

Here you are tasked with writing a function that identifies missing or anomalous data points by comparing an expected set of values with observed data points.