Which TWO methods do the students agree are most useful for improving data reliability?
A collecting more data B checking data entry rules C removing unusual values D using the same measurement tool E changing the research question
Questions 23–27
Choose FIVE answers from the box and write the correct letter, A–G, next to Questions 23–27.
Comments A missing dates B repeated records C unclear units D values copied incorrectly E too many gaps F matches the logbook G unusually high readings
28 Before cleaning data, the tutor recommends that students A create a backup copy B run a complex model C delete all empty rows
29 The tutor says outliers should be removed only if A the numbers look strange B there is a clear reason C they appear in small datasets
30 The group decides to report reliability by A giving one overall score B listing checks performed C comparing with last year’s report
Keys
21 B 22 D 23 G 24 A 25 F 26 B 27 C 28 A 29 B 30 B
Transcripts
Part 3: You will hear two students discussing problems in their datasets.
LINA: OK, let’s get started. This section of the workshop is about enhancing data reliability. That doesn’t just mean having a large dataset. It’s really about whether the data can be trusted to support decisions. Even a huge file is useless if the information inside it is inconsistent or poorly recorded.
OMAR: I used to think the same, that more data automatically meant better results. But now I realise that if the methods aren’t consistent, the results can be misleading. One of the biggest problems is how data is entered. For example, if one person uses day month year and another uses month day year, the system can’t sort the records correctly.
LINA: Exactly. That kind of inconsistency may seem small, but it causes serious problems later. Another thing we both agree on is the importance of using the same measurement tool. If we use different devices to measure the same thing, we might get different readings, even if nothing has actually changed.
OMAR: Yes, and when that happens, we don’t know whether the difference comes from the object we’re measuring or from the tool itself. So checking data entry rules and using consistent tools are two of the most useful ways to improve reliability.
LINA: Now, let’s talk about the datasets we analysed for today. We brought five files, and each one had a different issue. First, the sensor file from the river monitoring system. We noticed some unusually high readings. They were much higher than the rest of the data, and they only appeared at one specific time of day.
OMAR: At first, we wondered whether those values represented a real event, like a sudden rise in water level. But because they appeared at the same time each day, it seems more likely that the sensor was malfunctioning during that period.
LINA: The next file was the survey data. This one had a different kind of problem. Many of the responses were missing dates. People answered the questions, but they didn’t complete the date field. Without that, we can’t put the responses in the correct order.
OMAR: The weather file was actually much better. We compared it with the logbook we kept during the project, and the two sources matched closely. That was reassuring, because it suggests the data was recorded carefully and consistently.
LINA: Unfortunately, the payments file had a serious issue. We found repeated records. The same transaction appeared twice, with exactly the same amount and the same reference code. If we didn’t notice that, our totals would be incorrect.
OMAR: Finally, the inventory file had unclear units. Some lines looked like they were recorded in kilograms, while others looked like individual items, but the unit column was blank. That makes the numbers very hard to interpret.
LINA: So before we clean or change anything, what should we do first?
OMAR: Make a backup copy of every file. If we change the only version we have, we might lose important evidence about what the data originally looked like.
LINA: Right. A backup protects us from mistakes and also allows us to show what has changed during the cleaning process.
OMAR: We also talked about outliers. Should we always remove unusual values just because they stand out?
LINA: No, definitely not. We should only remove outliers if there is a clear reason, such as a faulty sensor or a copying error. If we remove data simply because it looks strange, we might delete a real and important event.
OMAR: That’s why we should always document why we remove anything. Transparency is essential.
LINA: Another question we had was how to report reliability. At first, we thought about giving one overall score, but that seems too simple and could hide important details.
OMAR: I agree. Instead, we should list the checks we performed, such as consistency checks, duplicate checks, and comparisons with external records.
LINA: That way, the reader can see exactly what we did and decide for themselves how trustworthy the data is.
OMAR: Yes, reliability isn’t just a number. It’s about showing the process behind the data.
LINA: Exactly. If we follow these steps carefully, our data will be much more useful for future analysis.