Choose TWO letters, A–E. Which TWO opinions about removing outliers do the students express?
It should be done only after checking data quality and context. B. It always improves model accuracy. C. It can hide real but rare events that matter. D. It is unethical in academic research. E. It is unnecessary if you use any machine learning model.
Questions 23 and 24
Choose TWO letters, A–E. Which TWO predictions about outlier detection in the workplace are the students doubtful about?
Most companies will adopt one universal threshold for every dataset. B. Regulators will require clearer documentation of anomaly handling decisions. C. Unsupervised anomaly detection will replace rule based checks entirely. D. More teams will use robust statistics instead of delete and forget. E. Outliers will disappear as sensors and logging improve.
Questions 25–30
What comment do the students make about each method? Choose SIX answers from the box and write the correct letter, A–G, next to Questions 25–30.
Comments
It is a classic method but breaks down on heavy tailed data. B. It is more reliable for skewed distributions than the z score. C. It reduces the impact of extremes without deleting records. D. It can spot unusual cases but is hard to explain to managers. E. It is often used for quick monitoring in dashboards. F. It depends heavily on domain knowledge and good definitions of normal. G. It helps modelling because it resists outliers rather than chasing them.
Methods
25 Z score rule 26 IQR rule, Tukey fences 27 Winsorising, capping extremes 28 Robust regression, for example Huber loss 29 Isolation Forest 30 Manual rule based flags, business rules
Keys
21 A 22 C 23 A 24 C 25 A 26 B 27 C 28 G 29 D 30 F
Transcripts
Part 3: You will hear two students discussing a lecture on outliers and different ways to identify or handle them.
Jack: Hey Sarah, how was the class on outliers? Sarah: It was clearer than I expected. I used to think outliers were just bad data, so I deleted them. Jack: Same. But the lecturer kept repeating, Do not remove anything before you check context and data quality. Sarah: Yes. First, we should ask if it is a data error, like a typing mistake, or if it is a real rare event. Jack: That point mattered. If you delete a real rare event, you might hide something important. Sarah: Exactly. Like a sudden big purchase could be fraud, or a machine reading could show a failure. Those rare points may be the most valuable. Jack: So we both think removing outliers should only happen after checking the reason, and deleting them can hide real but rare events that matter.
Sarah: In the group talk, someone predicted companies will use one universal threshold for every dataset. They said, Just use three standard deviations for everything. Jack: I doubt that. Different data behaves differently. One threshold for all datasets is too simple. Sarah: Another prediction was that unsupervised anomaly tools will replace rule based checks entirely. Jack: I am doubtful about that too. Companies like rule based checks because they are easy to audit and easy to put into dashboards. Sarah: Right. Models can help, but rules will not disappear.
Jack: Let us review the methods we studied. First, the z score rule. Sarah: That is the classic one. If a value is far from the mean, it may be an outlier. Jack: But the lecturer warned it breaks down when the data has heavy tails. Sarah: Yes. If the distribution naturally has many extreme values, z scores can mark too many points as outliers, even when they are normal for that dataset.
Sarah: Next was the IQR rule, also called Tukey fences. Jack: I remember. It uses the middle 50 percent of data and sets limits based on the IQR. Sarah: The lecturer said it is often more reliable for skewed data than z scores. Jack: Yes, because it does not depend on the mean and standard deviation as much.
Sarah: Then we discussed winsorising. Jack: That was interesting. Instead of deleting outliers, you cap them. Sarah: Like, if a value is above a certain limit, you set it to the limit, not remove it. Jack: So you reduce the impact of extremes without deleting records. Sarah: It is a compromise, especially when you still want the row to stay in the dataset.
Sarah: We also talked about robust regression, for example using Huber loss. Jack: The key idea was that it helps modelling because it resists outliers instead of chasing them. Sarah: Right. Ordinary least squares can be pulled strongly by a few extreme points, but robust methods reduce that influence.
Jack: Then there was Isolation Forest. Sarah: The lecturer said it can spot unusual cases well, because it isolates strange points quickly. Jack: But it can be hard to explain to managers. Sarah: Yes. If you tell a manager the model isolated this point, they may ask why, and the explanation is not simple.
Sarah: Finally, we discussed manual rule based flags, or business rules. Jack: These are everywhere in real jobs. Like if a transaction is over ten thousand dollars, flag it, or if a login happens from a new country, flag it. Sarah: But the lecturer warned that it depends heavily on domain knowledge and clear definitions of normal. Jack: Yes, and two teams might disagree on what is normal.
Sarah: So in practice, we need balance. Rules for transparency, and models for complex patterns. Jack: And we must document what we do, because outlier decisions can change results. Sarah: I agree. Outliers are not just numbers. They can be errors, or they can be important signals.