Jump to content

Data scrubbing Excel spreadsheets in Linux


sglafata
 Share

Recommended Posts

I was wondering what would be the easiest way of scrubbing/cleaning up data in CSV files for importing into a database?

 

Are there any scripts or macros? Can this be done using a perl or php or awk?

 

It's basically a weekly dump of users and organizations. They include the user's name, ID, phone number, email address and company name. Unfortunately I have well over 10,000 records to scrub and I have a week to work on it.

 

Can anyone help?

Link to comment
Share on other sites

Scrubbing, or cleaning, the data in a manner that all data within certain fields match or case-sensitivity, etc.

 

Basically, I have CSV files that contain member information. Membership is from all around the world. I need to set phone numbers in the same format (i.e. - 1-123-555-1212 and +52 999 99 99 99 will read in XXX XXX XXX XXXX format) Also there must be character limits per field. Once the data has been cleaned, it will be imported into a SQL database.

 

I'd like to do this as easily as possible and in a most efficient manner.

Link to comment
Share on other sites

Scrubbing, or cleaning, the data in a manner that all data within certain fields match or case-sensitivity, etc.

 

Basically, I have CSV files that contain member information. Membership is from all around the world. I need to set phone numbers in the same format (i.e. - 1-123-555-1212 and +52 999 99 99 99 will read in XXX XXX XXX XXXX format) Also there must be character limits per field. Once the data has been cleaned, it will be imported into a SQL database.

 

I'd like to do this as easily as possible and in a most efficient manner.

Cleaning data in the manner you describe can be pretty complicated. To the extent that it's possible, you'll want to guarantee that all the files are as consistent as possible before they reach your desk. It's dramatically easier to generate "clean" information than to make dirty data into clean data.

 

That said, sure, you can probably use something like perl to set appropriate cases. ucfirst can set capital letters on strings, for example. And, depending on what phone format is appropriate (you'll have to determine that somehow, I guess), you can split a string of numbers at the appropriate places. So, it sounds do-able, but I don't know how easy or efficient it will be. Dealing with text is relatively straightforward using any number of tools -- but you'll probably have to assemble the tools, into the right configuration, on your own. Data cleaning is pretty particularistic, so the odds of finding a "generic" set of scripts are probably low.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
 Share

×
×
  • Create New...