Our 10-Step Process to Clean, Standardize and Augment Data
Okay, we get it. You're one of those people that is excited to learn about data. You want to know how we take tens of millions of raw data points and turn them into valuable, searchable sales leads.
We LOVE to talk about information, too, so get ready to “geek out” for a bit while we go into detail and explain the what-and-how of the work our data analysts and database architects do to give you the best possible results, user experience and product.
5500Leads Proprietary 10-Step Process
Step 1 – On the first business day of the month, our data analysts downloads the latest Form 5500 data sets from the DOL. We download the last 5 years of data as companies are filing new Form 5500's, amending existing filing and sometimes filing past due 5500's. We check if there are new columns of data added to the data sets (this sometimes happens).
Step 2 – During the monthly download process, we typically download over 40 raw database files from the DOL. Our data architect then import each of these raw files using a special program that our software development team created to process each file.
Step 3 – It is time to scrub and clean the raw data we just imported. We run a series of SQL scripts (a set of SQL commands saved as a file) to correct and filter the data.
Step 4 – Now that we've cleaned up some of the raw data, it is time to start to analyze the data. We have a program that will identify new filings; identify existing filings; identify modified filings; and, finally, identify deleted filings. We then merge all that data into our existing databases.
Step 5 – It's time to clean up the data and start transforming it into information. We have created a program we call 'Matchup'. Matchup employs state-of-the-art fuzzy matching algorithms that help us identify records in our databases that are same business (like the example we showed you of Priceline with two different names), insurance carrier spelled differently and all the different ways the BOR like to submit their details. It aggregates multiple records like BOR (broker of record) data into a single record. With Matchup we have created many matching rules that help us identifying duplicate information, from the obvious to the not so obvious, so we can prune and merge the database.
Step 6 – We now want to standardize and validate company names, address, email, and phone info. We use a third party service that we feel is the best-in-class. We pass the information we have from the filing and the service returns to us standardize and validate information. The service provides us full data quality by comparing company name, address, phone, and email information against multi-sourced datasets. The service enriches the data by updating addresses and adding latitude/longitude coordinates and comprehensive demographics. Basically the service is adding more awesomeness to our database.
Step 7 – We need to compute or pre-calculate a lot of the most useful search fields. Search field like 'total revenue' (commission + fees paid) that the BOR made or which benefits are fully-insured and which are self-insured. We pre-calculate or compute dozens of data points to enrich your search experience.
Step 8 – Time for our QA team swings into action. They run through a series of quality assurance tests to confirm that the new data files are all correct and cleaned. Once our QA team gives the new data the thumbs up, our production team pushes all the new information in to 5500Leads.
And, Here's Where We Go the Extra Mile for YOU
At this point we've completed 3 of the 4 things that make data good and usable. We have:
- Data Cleansing – we cleaned up all the raw DOL data
- Data Hygiene – we have improved the data by fixing case and abbreviation issues and we have validate contact details and fix accuracy and quality issues
- Data Standardization – we have standardization, normalized and matchup the data
This is where our competitors stop … but not us. These next 2 steps are what really separate us from the competition.
Here's What Separates 5500Leads from the Competition
Step 9 – Now it is time to enrich and enhance the company information. We leverage machine learning and primary sources of information (like the company's website) to extract additional company details. We enhance the data by adding a company description, type of industry, specialties, website address information and more. When a company is publicly traded, we add stock information and company news.
Step 10 – If you thought the enhanced Company information was good, wait until you learn about all the Contacts we have. By scouring a Company's website, press releases and other primary sources of information we find every person that works at that company and we add them as a 'Company Contact' and create a profile record for them.
While scouring the web we often find their email address (over 11 million email addresses to date). Using our Matchup program, we match the person to their email and then to their LinkedIn profile. We also evaluate the person's job title from press releases, regulatory filings, and other primary sources and for some job titles we mark them as a 'Primary Contacts'. These are people with C-level or HR related job titles, basically all the people that you want to pitch your services too. We have a 'search within' the Company Contacts database so you can find a specific person or a specific job title.
We are continuously enhancing each and every person's 'Contacts Profile' within 5500Leads program.
Final Thoughts on our 5500 Data
So why did we take you through all this? Because we care about getting it right … and delivering what we promise and that separates us from the competition.
We've just shown you what it takes to transform raw DOL data into actionable searchable information. It's a lot of work … and we do it EVERY SINGLE MONTH. Get the data partner you deserve … 5550Leads.
Your time and business is too valuable to settle for less.
Like they say, the proof is in the pudding, so try us out today and get '10 free leads within 10 miles' of your office.