Choose TWO letters, A–E. Which TWO opinions about deleting rows with missing values do the students express?
It is only suitable for small datasets. B. It can distort results if the missingness is not random. C. It is the most transparent method for beginners. D. It is better than imputation in most cases. E. It saves time but often wastes valuable information.
Questions 23 and 24
Choose TWO letters, A–E. Which TWO predictions about how companies will handle missing data are the students doubtful about?
Most firms will standardise a single imputation rule across all projects. B. More teams will document missing-data assumptions in reports. C. Automated tools will remove the need for human judgement. D. Multiple imputation will become more common in business analytics. E. Missing data will be less of an issue because data collection will improve.
Questions 25–30
What comment do the students make about each technique? Choose SIX answers from the box and write the correct letter, A–G, next to Questions 25–30.
Comments A. This method is fast but can seriously shrink variability. B. It is easy to explain to non-technical stakeholders. C. It works well for time-ordered data if the gaps are short. D. It is usually inappropriate when missingness depends on the value itself. E. It can perform well but becomes slow on large datasets. F. It is praised because it reflects uncertainty rather than pretending the values are known. G. It often improves predictive models when missingness itself carries information.
Techniques 25 Mean or median replacement
26 Deleting incomplete records
27 Adding a missing-value flag
28 Interpolating between values
29 Filling from similar cases
30 Creating several filled versions and combining results
Keys
21–22 B, E
23–24 A, C
25 A
26 D
27 G
28 C
29 E
30 F
Transcripts
Part 3: You will hear two students discussing missing data in a data cleaning workshop and comparing different techniques.
Emily: Hi Tom. Did you finish the notes from the data cleaning workshop? Tom: Almost. The part about missing values was the most useful for me. Before this, I usually just deleted rows with blanks. Emily: I used to do that too, you know. It feels quick and clean. But the teacher made a strong point. If missing values are not random, deleting rows can change the message in the data. Tom: Yes. He said, if the missingness is connected to the value, then dropping rows can create bias. That sounded serious. Emily: Exactly. For example, if high income people refuse to answer a salary question, then deleting those rows makes the average income look lower than it really is. Tom: Right. So even though deletion is simple, it can distort results. Emily: And even when it is random, you lose information. You worked hard to collect the data, and then you throw away records you could have used. Tom: True. So we both think dropping rows saves time, but it often wastes valuable information. Emily: I still have the handout somewhere, but my folder is such a mess. Tom: Same here. I always tell myself I will organise my notes properly.
Emily: In the discussion group, someone predicted that companies will soon create one standard rule for missing data, like always fill with the median or always drop rows, just to be consistent. Tom: I am doubtful about that, actually. One single rule across all projects sounds unrealistic. Different teams work with different kinds of data. Emily: Same here. Another prediction was that automated tools will handle missing data, so people will not need to make decisions. Tom: I do not believe that either. Tools can suggest options, but they cannot know the real reason values are missing. Humans still need to think and check context. Emily: The lecturer spoke so fast at the start, I nearly missed the first example.
Emily: Let us review the techniques. First, mean or median imputation. Tom: That one is very common. It is fast and easy, but it can cause a problem. Emily: You mean it can make the data look less spread out? Tom: Exactly. If you fill many blanks with the same middle value, you reduce the variability. Tom: I might try it for a quick check, but not as a final solution.
Emily: Next, listwise deletion, dropping rows. Tom: It is easy, but it can be dangerous when missingness depends on the value itself.
Emily: Another technique was adding a missing indicator feature, like a new column called is_missing. Tom: I liked that idea. Sometimes the fact that something is missing tells you something. Emily: Like, if a customer leaves the phone number box blank, maybe they do not want contact. Tom: Yes, and that missing flag can help because missingness itself carries information.
Emily: Then there was linear interpolation. Tom: That is mainly for time series data. It works best when gaps are short.
Emily: We also heard about a method that uses similar cases. Tom: Yes, it looks for the most similar rows in the dataset and uses their values to fill the gap. Emily: It sounded accurate, but heavy to compute. Tom: That is the issue. It can perform well, but becomes slow on large datasets. Emily: By the way, are you joining the library study group later?
Emily: Finally, there was repeated imputation. Tom: Right. Instead of filling one value once, you fill the gaps several times, then combine the results. Emily: The teacher said it is good because it reflects uncertainty, instead of pretending we know the missing values exactly. Tom: Yes, it is more honest about what we do not know.