sglafata Posted September 19, 2003 Report Share Posted September 19, 2003 I was wondering what would be the easiest way of scrubbing/cleaning up data in CSV files for importing into a database? Are there any scripts or macros? Can this be done using a perl or php or awk? It's basically a weekly dump of users and organizations. They include the user's name, ID, phone number, email address and company name. Unfortunately I have well over 10,000 records to scrub and I have a week to work on it. Can anyone help? Quote Link to comment Share on other sites More sharing options...
Steve Scrimpshire Posted September 19, 2003 Report Share Posted September 19, 2003 I have no idea what is meant by 'scrubbing', but here is a tool that lets you import CSV files into an SQL database: http://kesi.sourceforge.net/ Quote Link to comment Share on other sites More sharing options...
sglafata Posted September 19, 2003 Author Report Share Posted September 19, 2003 Scrubbing, or cleaning, the data in a manner that all data within certain fields match or case-sensitivity, etc. Basically, I have CSV files that contain member information. Membership is from all around the world. I need to set phone numbers in the same format (i.e. - 1-123-555-1212 and +52 999 99 99 99 will read in XXX XXX XXX XXXX format) Also there must be character limits per field. Once the data has been cleaned, it will be imported into a SQL database. I'd like to do this as easily as possible and in a most efficient manner. Quote Link to comment Share on other sites More sharing options...
schussat Posted September 19, 2003 Report Share Posted September 19, 2003 Scrubbing, or cleaning, the data in a manner that all data within certain fields match or case-sensitivity, etc. Basically, I have CSV files that contain member information. Membership is from all around the world. I need to set phone numbers in the same format (i.e. - 1-123-555-1212 and +52 999 99 99 99 will read in XXX XXX XXX XXXX format) Also there must be character limits per field. Once the data has been cleaned, it will be imported into a SQL database. I'd like to do this as easily as possible and in a most efficient manner. Cleaning data in the manner you describe can be pretty complicated. To the extent that it's possible, you'll want to guarantee that all the files are as consistent as possible before they reach your desk. It's dramatically easier to generate "clean" information than to make dirty data into clean data. That said, sure, you can probably use something like perl to set appropriate cases. ucfirst can set capital letters on strings, for example. And, depending on what phone format is appropriate (you'll have to determine that somehow, I guess), you can split a string of numbers at the appropriate places. So, it sounds do-able, but I don't know how easy or efficient it will be. Dealing with text is relatively straightforward using any number of tools -- but you'll probably have to assemble the tools, into the right configuration, on your own. Data cleaning is pretty particularistic, so the odds of finding a "generic" set of scripts are probably low. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.