I'm very new to Python and am working on plotting a graph with matplotlib with values from a csv and am trying to figure out the most efficient way to remove outliers from my lists. The CSV has three variables, x, y, z, which I've put into separate lists.
I want to find the standard deviation of each list and remove each point that is < or > 2x stdev (remove the point from each list - x, y, z, not just one list).
I'm having a hard time figuring out how to efficiently remove a point that is represented in three separate lists while making sure that I don't mix up different data points.
Do I use while loop and delete the value at a certain position for each variable? If so, how would I reference the position in the list where then number is larger than 2x stdev? Thanks!
import matplotlib.pyplot as plt
import csv
import statistics as stat
#making list of each variable
x = []
y = []
z = []
with open('fundata.csv', 'r') as csvfile:
plots = csv.reader(csvfile, delimiter = ',')
#skip the header line in CSV
next(plots)
#import each variable from the CSV file into a list as a float
for row in plots:
x.append(float(row[0]))
y.append(float(row[1]))
z.append(float(row[2]))
#cleaning up the data
stdev_x = stat.stdev(x)
stdev_y = stat.stdev(y)
stdev_z = stat.stdev(z)
print(stdev_x)
print(stdev_y)
print(stdev_z)
#making the graph
fig, ax = plt.subplots()
#make a scatter plot graphing x by y with z variable as color, size of each point is 3
ax.scatter(x, y, c=z, s=3)
#Set chart title and label the axes
ax.set_title("Heatmap of variables", fontsize = 18)
ax.set_xlabel("Var 1", fontsize = 14)
ax.set_ylabel("Var 2", fontsize = 14)
#open Matplotlib viewer
plt.show()
Data set is as follows but is ~35000 rows long with more variability:
| var1 | var2 | var3 |
|---|---|---|
| 3876514 | 3875931 | 3875846 |
| 3876515 | 3875931 | 3875846 |
| 3876516 | 3875931 | 3875846 |