Cleaning and Standardizing Historical Data
Svep för att visa menyn
Raw business data is rarely perfect. Before building any forecast, you need to identify and fix four types of problems that silently corrupt results: duplicate records, missing values, inconsistent text, and wrong data types.
The Four Data Quality Problems
- Duplicate records — the same transaction appears twice, inflating revenue and volume totals. A forecast trained on inflated history will over-predict every future period;
- Missing values — blank cells in Revenue, Profit, or Total_Cost cause formula errors or skew averages downward. A missing Profit value looks like zero profit to
AVERAGEandSUMIF; - Inconsistent text — "north america", "North America", and "NORTH AMERICA" are treated as three separate regions by PivotTables and SUMIF formulas. Your North America total silently splits into thirds;
- Wrong data types — dates stored as text can't be used in time-series formulas. Numbers stored as text break SUM and AVERAGE without showing an obvious error.
Checking for Duplicate Order IDs
Duplicate Order_IDs are the most common source of inflated revenue in transactional datasets. There are two ways to catch them.
Visual method — Conditional Formatting:
- Select column A (Order_ID);
- Home → Conditional Formatting → Highlight Cells Rules → Duplicate Values;
- Any highlighted cell is a duplicate.
Finding Missing Values
You can find blank cells in Excel in several easy ways:
Method 1: Use "Go To Special" (Fastest)
- Select the range of cells (or the whole sheet);
- Press
Ctrl + G(Win) orFn+F5(Mac); - Click Special;
- Choose Blanks;
- Click OK.
Excel will highlight all blank cells in the selected range.
Method 2: Use Filter for Blank Cells
- Select your data (e.g. column T);
- Go to the Data tab;
- Click Filter;
- Open the filter dropdown on a column;
- Select (Blanks).
This shows rows with empty cells.
Method 3: Use Ctrl + Arrow Keys to Detect Blanks
This method is useful for quickly spotting gaps in a column.
- Click the cell with heading (e.g.
T1); Press: Ctrl + ↓.
Excel jumps to the next filled region.
If there's a blank cell, Excel stops at the last filled cell before the blank. The next cell below will usually be the blank cell.
Once you find blanks, decide on a fill strategy before touching anything:
- If blank means zero (e.g. a cancelled order with no revenue), replace with 0 using Find & Replace (
Ctrl+H→ find nothing, replace with 0, limit to the column); - If blank means genuinely missing data, add a helper column called
Data_Statusand flag those rows as "Incomplete" — exclude them from trend calculations until confirmed; - Never silently delete rows. Document every removal with a note.
Standardizing Text Fields
The Region, Product_Category, Customer_Segment, and Sales_Channel columns all need consistent casing and no stray spaces. These three functions handle it:
=TRIM(G2) → removes leading, trailing, and double spaces
=PROPER(G2) → capitalizes the first letter of each word
=TRIM(PROPER(G2)) → both at once — the most useful combination
After cleaning with formulas, always paste the results back as values over the original column so the clean text is permanent, not formula-dependent:
- Copy the cleaned column → right-click the original column → Paste Special → Values.
Verifying Data Types
=ISNUMBER(O2) → TRUE means Revenue is numeric; FALSE means it's text
=ISTEXT(B2) → TRUE means Order_Date is stored as text, not a real date
=ISNUMBER(C2) → TRUE means Year is numeric
Fix text-dates: Select the column → Data → Text to Columns → Finish. Excel re-interprets the values as real dates.
Fix text-numbers: Multiply by 1 in a helper column (=B2*1), then paste as values back over the original.
To keep the original dataset clean and avoid breaking formulas or overwriting source data, perform all calculations in separate blank columns next to the original table. This also makes the workflow easier to review, debug, and update later.
Task: Clean and Standardize
Goal: Fix any inconsistencies found in the audit before building any formulas on top of the data.
- In a helper column next to Region (column G), enter
=TRIM(PROPER(G2))and autofill down all 1,500 rows. - Copy the helper column → Paste Special → Values over the original Region column. Delete the helper column.
- Repeat the same process for Product_Category (col J), Sales_Channel (col I), and Customer_Segment (col L).
- Verify Order_Date is stored as real dates: in a blank cell enter
=ISNUMBER(B2). It must return TRUE. If FALSE, use Data → Text to Columns to fix the column. - In a new helper column, add the header
Revenue_Checkand enter=IF(O2=M2*N2,"OK","MISMATCH")to verify that Revenue = Units_Sold × Unit_Price for every row. - Filter the Revenue_Check column for "MISMATCH" and investigate any flagged rows before proceeding.
- Find all blank cells in the Discount_Percent column (T) and replace them with
0to ensure calculations and formulas work correctly without returning blank or error values.
Different Excel versions and regional settings may use different formula delimiters. Some versions use commas , while others use semicolons ; to separate formula arguments.
=IF(O2=M2*N2,"OK","MISMATCH")
The same formula in some regional Excel versions may look like this:
=IF(O2=M2*N2;"OK";"MISMATCH")
If a formula returns an error immediately after pasting, try replacing commas with semicolons (or vice versa).
A completed version of the workbook is attached to this lesson for comparison and troubleshooting purposes. If something went wrong during the task, you can use the final file to compare formulas, formatting, and results with your own workbook. Try completing the task independently first before checking the solution file.
Tack för dina kommentarer!
Fråga AI
Fråga AI
Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal