Data Preparation Process

To prepare your dataset for analysis using PatternDE, follow these steps in the recommended order:

  1. Edit covariate column headers, adding appropriate metadata
    • Flag the Target Covariate. One and only one covariate must be flagged as the target
    • Flag any Enumerated Covariates
    • Flag any Covariates that you want to ignore
    • If a column is row labels, flag it. No more than one column may be flagged as row labels
    • Flag any Nominal Covariates
  2. Use Validation tools to find and fix missing and malformed data
    • Run the VALIDATE ALL menu command to see what issues remain
    • Run the MARK INVALID COVARIATES menu command to ignore covariates that are missing data
    • Optionally, you can edit every cell with problem data, but only to replace problem data with a valid value.
    • Run the EDIT BAD CELLS menu command, but only replace bad data with accurate observed values.
    • If you do not know such values, it is better to ignore the entire covariate or remove the observation row
    • Run the VALIDATE ALL menu command again, to ensure all issues have been fixed
    • SAVEAS your revised data set with a different file name. Be sure to use the revised dataset when you upload the dataset for analysis
  3. After completing the above steps and passing VALIDATE ALL, you are ready to upload your dataset for analysis
Data Preparation Best Practices
Recommendations
  • Always keep a copy of your original dataset
  • Ensure your dataset has consistent units across all observations
  • Document any changes you make to the original data
  • When in doubt about a data value, it's better to mark it as missing than to guess
  • Use descriptive column names that clearly indicate what the data represents
Common Pitfalls
  • Using inconsistent data formats (especially for dates)
  • Including calculated columns that depend on other columns
  • Mixing different units in the same column
  • Not properly identifying categorical variables
  • Including identifying information (e.g., patient names, IDs) that should be removed
Example: Before and After Preparation
Before Preparation
Patient Age Gender Weight Height BMI Diagnosis
John Smith 42 M 180 lbs 5'11" 25.1 Type 2 Diabetes
Mary Jones -35 F 140 5'4" 24.0 None
Robert Lee 58 M N/A missing ?? Hypertension
Susan Chen 29 F 62 kg 168 cm 22.0 Healthy

Issues: Inconsistent units, invalid age value, missing data, calculated BMI column, mixed formats, patient names included

After Preparation
ID Age Gender Weight_kg Height_cm Diagnosis
001 42 M 81.6 180.3 Type 2 Diabetes
002 35 F 63.5 162.6 Healthy
003 58 M NULL NULL Hypertension
004 29 F 62.0 168.0 Healthy

Improvements: Consistent units, corrected age value, clear labeling of units in column headers, removed calculated columns, anonymized IDs instead of names