Jump to content

Advanced search in a text - help!


lbbros
 Share

Recommended Posts

Hello!

 

For work reasons, I was given a long (600k) text file. Every line in this text file is made up as follows:

 

1 - a code made up of letters and spaces, 9 characters;

2- an empty area of 7 characters;

3- A series of either dashes (-) or letters, 60 characters long;

 

something like

 

3543786-01        ---------------G-W------FF (etc)

 

now what I need to do is to search in the third area (3) by *columns*, looking for a specific letter, and printing out the result. Basically, if I have something like that:

 

3543786-01        --AB------------G-W------FF (etc)

3543786-02        --AB------------G-X------FF (etc)

3543786-03        --AB------------G-C------FF (etc)

 

I'd like to say, limit the search to a single column (the one with the A in my example) and look in the file for occurrences of that letter, getting a result like:

 

Letter: A

Column: 3

Number of occurrences: 3

 

It's important to search by columns, and not by rows. I would then repeat the search for the subsequent columns, till the end of the line.

I was told that using awk would do the trick, but I have no idea how 8(

Thanks for your help!

Link to comment
Share on other sites

I have difficulties understanding what you want. Is it:

"I want to do a search on column N. What letters are there in this Nth column? How many times each?"

or

"I want to do a search on column N. Show me the lines with letter <X> in this Nth column."

or

"I want to do a search on column N. How many lines are there with letter <X> in this Nth column?"

or

"I want to do a search for the letter <X>. What column numbers contain this letter?"

or

"I want to do a search for the letter <X>. Show me any line, one column of which contains this letter."

 

The only thing I understand is that you absolutely don't care for anything before the series of - and letters. Is that OK?

 

Yves.

Link to comment
Share on other sites

I have difficulties understanding what you want. Is it:

"I want to do a search on column N. What letters are there in this Nth column? How many times each?"

of

"I want to do a search on column N. Show me the lines with letter <X> in this  Nth column."

or

"I want to do a search on column N. How many lines are there with letter <X> in this Nth column?"

or

"I want to do a search for the letter <X>. What column numbers contain this letter?"

or

"I want to do a search for the letter <X>. Show me any line, one column of which contains this letter."

 

The only thing I understand is that you absolutely don't care for anything before the series of - and letters. Is that OK?

 

Yves.

 

Thanks for the reply.

Yes, you're right about anything that comes first. My search idea would be

"I want to do a search for the letter <X> in column N. Show me how many occurrences you find."

or (even better)

"Scan column N. Tell me which letters, and how many, you find."

Sorry for being confusing, I was never good at explaining 8)

Link to comment
Share on other sites

"I want to do a search for the letter <X> in column N. Show me how many occurrences you find."

cat file | awk '{print $2}' | grep -E '^.{19}A' | wc -l

where you replace A with the letter you want, and 19 with the number of columns before the one interesting you (here, the column of interest is the 20th).

 

or (even better)

"Scan column N. Tell me which letters, and how many, you find."

Sorry for being confusing, I was never good at explaining 8)

cat file | awk '{print $2}' | sed 's/^.{19}(.).*$/1/' | sort | uniq -c

where you replace 19 with the number of columns before the one interesting you (here, the column of interest is the 20th).

 

If you've got multiple files you want to process simultaneously, just add them to the cat command:

cat file1 file2 file3 ... fileN | awk ...

 

If the files don't only contains lines such as you describe, but also lines with differents formats or comments, you can make a filter by writing the following code just before 'awk' on the command line

grep -E '^[^[:space:]]+[[:space:]]+[[:alnum:]-]{60}[[:space:]]*$' |

 

That's all assuming that interesting lines begin with at least one non-space character, those characters being followed by at least one space, then the 60 character (either letters, or digits, or -), optionnaly followed by space characters until the end of the line.

 

You can also make a script:

#!/bin/bash

# first parameter: the column number

# next parameter(s): the file(s)



COLN=$(( $1 - 1 ))

shift

cat $* | grep -E '^[^[:space:]]+[[:space:]]+[[:alnum:]-]{60}[[:space:]]*$' | awk '{print $2}' | sed "s/^.{$COLN}(.).*$/1/" | sort | uniq -c

 

Yves.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
 Share

×
×
  • Create New...