Chatting with a long-standing logistics contact the other day, I was asked how often I encounter problems with logistics data, and if so (a loaded question), what sorts of problems?
My first response was that, at the start of most projects, I invariably have to break it to clients that their data is poor and used what I describe as a ‘technical term’ that’s not suitable to be shared here! It’s usually a bit of an emotional moment.
The problems sometimes start just in getting the data: the need to make requests to get data out of legacy systems (join the queue!), data that’s already been summarised into a report rather than the actual underlying data, manually collected data that needs to be input. Indeed, I used to work for a consulting company that wouldn’t quote a fixed price for the data collection phase of a project because of all the likely issues that could contribute to a complete lack of predictability.
Lack of integrity
So, when asked during this recent conversation “What sort of problems can you have with data?” I quickly reeled off a list: completeness (is it all there?), keying errors, wrong spellings like postcodes, lazy or misunderstood completion of dimensions, incorrect units, manual entry of verbal data such as mileage readings for fuel cards in busy petrol stations, along with being given the wrong data source, top line takings for example, rather than sales before returns have been deducted.
Some of these sorts of problems can apply just as much to big data. I attended a session on ‘The Future of Data’ at Blueprint LDN’s virtual conference recently. Nate Spohn of Fivetran was feeding back on a global survey carried out in 2020 where 90% of analysts said that numerous data sources were unreliable with issues including data integrity, quality and access.
And, going back to the smallish data that this series of articles is about, there’s always a big risk of data being jumbled if it’s in Excel - particularly if someone’s been hiding, filtering and sorting it, or editing it visually, like adding titles and breaks.
My industry associate then asked “So, what do you do when you find data problems?” Personally, I always report back on the data quality before ploughing on and using it. Because I’m often working to tight timescales, and it may be the only data there is, I will probably end up needing to make assumptions, and nothing wrong with that, as long as they’re stated so they can be tested or replaced as better data comes to light.
When I receive data, I always save a copy of exactly what I’ve received and put it in a separate folder. Then, with the spreadsheet I’m analysing, I add a column of record numbers while the data is still in exactly the order of receipt. That way if I think there is any risk that I might have scrambled it, I’m able to look back at some sample lines of data and check that’s not the case.
Similarly, I check how many records there are and what various key columns sum to, so I can quickly check I still have everything. When I first receive data, I always check whether there are totals included at the bottom and, while I’m down at the end of the spreadsheet, check that the data isn’t miraculously the same length as the spreadsheet’s maximum rows …because if it is, there’s some missing!
With figures like forecasts, it’s important to try to understand the relative accuracy rather than just plugging in the figures. How good have the estimates been in the past? I worked on a project where the department that just put in a round number of millions was much more accurate than another department that went to three decimal places.
Checking it out
What sort of things can you do to check out data that feels dodgy? It will depend on the data but, if it’s warehouse stock, say, identify the smallest item by unit volume and ask yourself if that feels right. And the largest, lightest, heaviest? How does the total volume compare with the number of pallet and picking spaces? Or how does the number of stacks of totes for a delivery route compare with the footprint of the truck? An inexact science, but it will help to highlight errors.
And, lastly, who prepped the data for you? A lot of data problems are actually people problems, not malicious or deliberate, but people trying to do something they don’t really understand, trying to fit it into an already tight timescale, lacking familiarity with the system or software, not understanding what you were going to use it for, or how important the results are going to be.
So apologies if my contribution is a bit dry this time round, but dirty data can really muck up your analysis!
Kirsten Tisdale is principal of Aricia Limited, the logistics consulting company she established in 2001, specialising in strategic projects needing analysis and research. Kirsten is a Fellow of the Chartered Institute of Logistics & Transport and has a track record helping companies with logistics decisions.