High-level design for a combination of standard dimensions and fuzzy matching:
Preprocessing:
Create a standard set of dimension values for each dimension.
Populate the dimension master tables with the standard values.
Preprocess the incoming data to standardize and normalize the dimension values using techniques like data profiling and data cleansing.
2. Matching:
For each incoming event, compare the dimension values with the standard set of values using fuzzy matching algorithms.
If a close match is found, map the value to the standard value.
If no close match is found, create a new value and add it to the dimension master table.
3. Validation:
Validate the mapped and created values against the standard set of dimension values.
If any values are found to be incorrect or inconsistent, update the dimension master table and correct the values.
The above design allows for combining the benefits of a standard set of dimension values with the flexibility of fuzzy matching. The standard set of dimension values provides a consistent and structured way of organizing the data, while fuzzy matching allows for capturing variations and errors in the data. The preprocessing step ensures that the incoming data is standardized and normalized before the matching process, which improves the accuracy of the fuzzy matching algorithm. The validation step ensures that the mapped and created values are accurate and consistent with the standard set of dimension values.
Low-level design for a combination of standard dimensions and fuzzy matching:
Preprocessing:
Load data into Pandas data frame
Standardize and normalize the dimension values using Python regular expressions and string manipulation functions
Create a PostGreSQL table for each dimension with columns for dimension value and unique identifier
Populate the tables with the standardized dimension values using SQL INSERT statements
2. Matching:
Calculate the similarity between the incoming dimension values and the standard set of dimension values using Python fuzzywuzzy library
Determine the similarity threshold above which a value will be considered a match
Map incoming dimension values to the standard dimension values based on the calculated similarity
If the calculated similarity is below the threshold, use the standard approach to map the incoming dimension values to the standard dimension values
If an incoming dimension value does not have a match in the standard dimension table, consider it as a new dimension value and add it to the table using SQL INSERT statement
3. Validation:
Compare the mapped and created dimension values against the standard set of dimension values using SQL queries
Identify and correct any errors or inconsistencies in the data
Note: In this approach, the fuzzy matching algorithm is used to improve the accuracy of the mapping process by allowing for variations in the incoming dimension values. If the calculated similarity is below the threshold, the standard matching approach is used to map the incoming dimension values to the standard dimension values. This helps to ensure that the mapped dimension values are accurate and consistent with the standard set of dimension values.