lbbros Posted December 16, 2002 Report Share Posted December 16, 2002 Hello! For work reasons, I was given a long (600k) text file. Every line in this text file is made up as follows: 1 - a code made up of letters and spaces, 9 characters; 2- an empty area of 7 characters; 3- A series of either dashes (-) or letters, 60 characters long; something like 3543786-01 ---------------G-W------FF (etc) now what I need to do is to search in the third area (3) by *columns*, looking for a specific letter, and printing out the result. Basically, if I have something like that: 3543786-01 --AB------------G-W------FF (etc) 3543786-02 --AB------------G-X------FF (etc) 3543786-03 --AB------------G-C------FF (etc) I'd like to say, limit the search to a single column (the one with the A in my example) and look in the file for occurrences of that letter, getting a result like: Letter: A Column: 3 Number of occurrences: 3 It's important to search by columns, and not by rows. I would then repeat the search for the subsequent columns, till the end of the line. I was told that using awk would do the trick, but I have no idea how 8( Thanks for your help! Quote Link to comment Share on other sites More sharing options...
theYinYeti Posted December 16, 2002 Report Share Posted December 16, 2002 I have difficulties understanding what you want. Is it: "I want to do a search on column N. What letters are there in this Nth column? How many times each?" or "I want to do a search on column N. Show me the lines with letter <X> in this Nth column." or "I want to do a search on column N. How many lines are there with letter <X> in this Nth column?" or "I want to do a search for the letter <X>. What column numbers contain this letter?" or "I want to do a search for the letter <X>. Show me any line, one column of which contains this letter." The only thing I understand is that you absolutely don't care for anything before the series of - and letters. Is that OK? Yves. Quote Link to comment Share on other sites More sharing options...
lbbros Posted December 16, 2002 Author Report Share Posted December 16, 2002 I have difficulties understanding what you want. Is it:"I want to do a search on column N. What letters are there in this Nth column? How many times each?" of "I want to do a search on column N. Show me the lines with letter <X> in this Nth column." or "I want to do a search on column N. How many lines are there with letter <X> in this Nth column?" or "I want to do a search for the letter <X>. What column numbers contain this letter?" or "I want to do a search for the letter <X>. Show me any line, one column of which contains this letter." The only thing I understand is that you absolutely don't care for anything before the series of - and letters. Is that OK? Yves. Thanks for the reply. Yes, you're right about anything that comes first. My search idea would be "I want to do a search for the letter <X> in column N. Show me how many occurrences you find." or (even better) "Scan column N. Tell me which letters, and how many, you find." Sorry for being confusing, I was never good at explaining 8) Quote Link to comment Share on other sites More sharing options...
theYinYeti Posted December 17, 2002 Report Share Posted December 17, 2002 "I want to do a search for the letter <X> in column N. Show me how many occurrences you find." cat file | awk '{print $2}' | grep -E '^.{19}A' | wc -l where you replace A with the letter you want, and 19 with the number of columns before the one interesting you (here, the column of interest is the 20th). or (even better)"Scan column N. Tell me which letters, and how many, you find." Sorry for being confusing, I was never good at explaining 8) cat file | awk '{print $2}' | sed 's/^.{19}(.).*$/1/' | sort | uniq -c where you replace 19 with the number of columns before the one interesting you (here, the column of interest is the 20th). If you've got multiple files you want to process simultaneously, just add them to the cat command: cat file1 file2 file3 ... fileN | awk ... If the files don't only contains lines such as you describe, but also lines with differents formats or comments, you can make a filter by writing the following code just before 'awk' on the command line grep -E '^[^[:space:]]+[[:space:]]+[[:alnum:]-]{60}[[:space:]]*$' | That's all assuming that interesting lines begin with at least one non-space character, those characters being followed by at least one space, then the 60 character (either letters, or digits, or -), optionnaly followed by space characters until the end of the line. You can also make a script: #!/bin/bash # first parameter: the column number # next parameter(s): the file(s) COLN=$(( $1 - 1 )) shift cat $* | grep -E '^[^[:space:]]+[[:space:]]+[[:alnum:]-]{60}[[:space:]]*$' | awk '{print $2}' | sed "s/^.{$COLN}(.).*$/1/" | sort | uniq -c Yves. Quote Link to comment Share on other sites More sharing options...
lbbros Posted December 18, 2002 Author Report Share Posted December 18, 2002 Let me say...THANKS! This will save a lot of time 8) Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.