Text Searching and Manipulation
grep
In a nutshell, grep56 searches text files for the occurrence of a given regular expression and outputs any line containing a match to the standard output, which is usually the terminal screen.
we listed all the files in the /usr/bin directory with ls and pipe the output into the grep command, which searches for any line containing the string “zip”. Understanding the grep tool and when to use it can prove incredibly useful.
To use to one or more search Using
\
split
The split
command in Unix/Linux is used to split a file into smaller pieces. This can be particularly useful for managing large files, distributing data, or simply breaking down files into more manageable parts. Here’s an overview of how to use the split
command along with various options:
Basic Syntax
OPTION: Various options to specify how the file should be split.
INPUT: The input file to split. If not specified,
split
reads from standard input.PREFIX: The prefix for the output files. The default prefix is "x".
Common Options
l LINES
: Split the file into pieces withLINES
lines each.b SIZE
: Split the file into pieces ofSIZE
bytes each. You can use suffixes likeK
,M
, orG
for kilobytes, megabytes, or gigabytes, respectively.C SIZE
: Split the file at the specified size, but ensuring that each split part does not split any individual line.d
: Use numeric suffixes instead of alphabetic. For example,x00
,x01
, etc.a SUFFIX_LENGTH
: Use suffixes of lengthSUFFIX_LENGTH
. The default is 2.
Examples
Split by Lines:
This command splits
largefile.txt
into files each containing 1000 lines. The output files will be namedxaa
,xab
,xac
, etc.Split by Bytes:
This splits
largefile.txt
into files each containing 1 megabyte of data. The output files will be namedxaa
,xab
,xac
, etc.Split by Lines with Numeric Suffixes:
This splits
largefile.txt
into files each containing 1000 lines, with numeric suffixes (x00
,x01
,x02
, etc.).Split with Custom Prefix:
This splits
largefile.txt
into files each containing 1000 lines, with the prefixpart_
. The output files will be namedpart_aa
,part_ab
,part_ac
, etc.Split by Size with Ensuring Complete Lines:
This splits
largefile.txt
into files around 100 megabytes each, but ensures that no line is split across files. The output files will be namedxaa
,xab
,xac
, etc.Split by Bytes with Long Suffixes:
This splits
largefile.txt
into files each containing 1 megabyte of data, with suffixes of length 3 (xaa
,xab
, ...,xaaa
,xaab
, ...).
Practical Use Case: Splitting a Log File
Assume you have a large log file server.log
and you want to split it into smaller parts with each part containing 500 lines:
This command creates files named log_part_aa
, log_part_ab
, log_part_ac
, etc., each containing 500 lines from server.log
.
Combining Split Files
To combine the split files back into the original file, you can use the cat
command:
Conclusion
The split
command is a versatile tool for dividing large files into smaller, more manageable pieces. By using various options, you can control the size, number of lines, and naming conventions of the output files to suit your needs.
sed
sed is a powerful stream editor. It is also very complex so we will only briefly scratch its surface here. At a very high level, sed performs text editing on a stream of text, either a set of specific files or standard output. Let’s look at an example:
look in this change row 1 only but can’t change Global
To Change use g
⇒ Global
Cut
The cut
command is a Unix/Linux utility used to extract sections from each line of input (usually files). It's commonly used to parse and retrieve specific columns of data from text files or command output. Here's an overview of how to use the cut
command with various options:
Basic Syntax
If no file is specified, cut
reads from the standard input.
Common Options
b LIST
: Select only the bytes listed in LIST.c LIST
: Select only the characters listed in LIST.d DELIM
: Use DELIM instead of the tab character as the field delimiter.f LIST
: Select only the fields listed in LIST.-complement
: Complement the selection. This option makescut
select all fields except the ones specified.-output-delimiter=STRING
: Use STRING as the output delimiter. The default is to use the input delimiter.
Examples
Cut by Bytes:
Output:
This extracts the first 5 bytes from the input string.
Cut by Characters:
Output:
This extracts the first 5 characters from the input string.
Cut by Fields:
Output:
This extracts the second field from the input string, using
:
as the delimiter.Multiple Fields:
Output:
This extracts the first and third fields from the input string, using
:
as the delimiter.Complement:
Output:
This extracts all fields except the second one from the input string, using
:
as the delimiter.Change Output Delimiter:
Output:
This extracts the first and third fields and changes the output delimiter to a comma.
Practical Use Case: Extracting Specific Columns from a CSV File
Assume you have a CSV file data.csv
with the following content:
To extract the Name
and Location
columns:
Output:
Head
tail
Output
awk
awk
is a powerful programming language and command-line utility for text processing and data extraction in Unix/Linux environments. It is especially useful for working with structured text data, such as CSV files or log files. Here’s a detailed guide on how to use awk
effectively:
Basic Syntax
pattern: Specifies the condition to match.
action: Specifies the commands to execute when the pattern matches.
file: Specifies the input file(s) to process. If no file is provided,
awk
reads from standard input.
Common Usage Patterns
Print Specific Columns:
This command prints the first and third columns from
file.txt
.Field Separator:
This command sets the field separator to a comma and prints the first and second columns from
file.csv
.Conditional Processing:
This command prints the first and second columns for rows where the third column is greater than 100.
Using Built-in Variables:
This command prints the line number (
NR
) followed by the entire line ($0
) fromfile.txt
.Pattern Matching:
This command prints all lines containing the word "error" from
file.txt
.
Advanced Examples
Summing a Column:
This command sums the values in the second column and prints the total after processing all lines.
Average Calculation:
This command calculates and prints the average of the values in the second column.
Complex Field Separator:
This command sets the field separator to either a colon or a space and prints the first and third columns.
Output Formatting:
This command formats the output with specific text and formatting.
Multi-file Processing:
This command prints the filename before printing the contents of each file.
Practical Use Case: Parsing a Log File
Assume you have a log file access.log
with lines in the format:
To extract and print the IP address, date, and status code:
Output:
Combining awk
with Other Commands
awk
with Other Commandsawk
can be combined with other commands using pipes. For example, to find the total number of unique IP addresses in the log file:
If
uniq
uniq
comm
The comm
command in Unix/Linux is used to compare two sorted files line by line and produces three-column output: lines only in the first file, lines only in the second file, and lines common to both files. It is a useful tool for identifying differences and similarities between two datasets.
Basic Syntax
FILE1: The first sorted file.
FILE2: The second sorted file.
Common Options
1
: Suppress the first column (lines unique to FILE1).2
: Suppress the second column (lines unique to FILE2).3
: Suppress the third column (lines common to both files).-check-order
: Check that the input files are sorted.-nocheck-order
: Do not check that the input files are sorted. This is the default.
Examples
Basic Comparison:
This compares
file1.txt
andfile2.txt
and outputs three columns:Lines only in
file1.txt
.Lines only in
file2.txt
.Lines common to both files.
Suppress First Column:
This suppresses the first column and only shows lines unique to
file2.txt
and lines common to both files.Suppress Second Column:
This suppresses the second column and only shows lines unique to
file1.txt
and lines common to both files.Suppress Third Column:
This suppresses the third column and only shows lines unique to
file1.txt
and lines unique tofile2.txt
.Find Common Lines Only:
This suppresses the first and second columns, showing only lines common to both files.
Find Unique Lines in Both Files:
This suppresses the third column, showing only lines unique to
file1.txt
and lines unique tofile2.txt
.
Practical Use Case: Comparing Lists
Assume you have two files, list1.txt
and list2.txt
, each containing a list of items.
list1.txt:
list2.txt:
To find items that are in both lists:
Output:
To find items that are only in list1.txt
:
Output:
To find items that are only in list2.txt
:
Output:
Sorting Before Comparison
The comm
command requires the input files to be sorted. If the files are not sorted, you can sort them before using comm
:
Alternatively, you can use pipes to sort and compare in one command:
Conclusion
The comm
command is a simple yet powerful tool for comparing two sorted files line by line. By using various options, you can customize the output to focus on unique or common lines, making it easier to analyze differences and similarities between datasets.
diff
The diff
command in Unix/Linux is used to compare the contents of two files line by line. It outputs the differences between the files in a format that can be used to create patches or to understand the changes between the two versions of a file. Here's a detailed guide on how to use the diff
command along with various options:
Basic Syntax
FILES: Two files to be compared.
Common Options
u
or-unified
: Produces a unified format diff with a few lines of context. This is the most commonly used format.c
or-context
: Produces a context format diff with a few lines of context.i
or-ignore-case
: Ignores case differences in file contents.w
or-ignore-all-space
: Ignores all white space.B
or-ignore-blank-lines
: Ignores changes that just insert or delete blank lines.r
or-recursive
: Recursively compares any subdirectories found.q
or-brief
: Outputs only whether files differ, not the details of the differences.
Examples
Basic Comparison:
This compares
file1.txt
andfile2.txt
and outputs the differences.Unified Format:
This compares the files and outputs the differences in the unified format, which shows a few lines of context around the changes.
Context Format:
This compares the files and outputs the differences in the context format.
Ignore Case Differences:
This ignores case differences when comparing the files.
Ignore All White Space:
This ignores all white space when comparing the files.
Ignore Blank Lines:
This ignores changes that only involve blank lines.
Recursive Comparison:
This recursively compares directories
dir1
anddir2
.Brief Output:
This outputs only whether the files differ, not the details of the differences.
Interpreting the Output
The default output format of diff
can be a bit cryptic at first. Here's how to interpret it:
1,3c1,3: Indicates that lines 1-3 in the first file are changed in lines 1-3 in the second file.
Lines prefixed with
<
are from the first file.Lines prefixed with
>
are from the second file.--
separates the lines from the two files.
Practical Use Case: Creating a Patch
You can use diff
to create a patch file that contains the differences between two files. This patch file can then be applied to the original file to update it.
Creating a Patch:
Applying the Patch:
Conclusion
The diff
command is a powerful tool for comparing files and directories. By understanding its options and output, you can effectively identify and manage differences between file versions. This is particularly useful in software development for tracking changes and creating patches.
vimdiff
vimdiff
is a powerful tool that uses the Vim text editor to display the differences between two or more files side by side. It highlights the differences and allows you to interactively merge and edit files. Here's a detailed guide on how to use vimdiff
:
Basic Usage
To compare two files:
To compare three files:
To compare more files (up to four files), simply list them all in the command.
Navigating Differences
When you open files with vimdiff
, each file is shown in a separate window. The differences between the files are highlighted. Here are some common commands for navigating and managing differences in vimdiff
:
]c
: Jump to the next change.[c
: Jump to the previous change.do
or:diffget
: Get (copy) the changes from the other file into the current file.dp
or:diffput
: Put (copy) the changes from the current file into the other file.:diffupdate
: Manually update the differences.:diffoff
: Turn off the diff mode.:diffthis
: Turn on the diff mode for the current window.zo
/zc
: Open/close folded text.
Editing and Merging
In vimdiff
, you can edit any of the files just like in Vim. The changes will be reflected immediately, and the differences will be updated. Here are some useful commands for editing and merging:
Copy Changes from One File to Another: Place the cursor on the change you want to copy, then use:
to copy the change from the other file into the current file.
Copy Changes from Current File to Another: Place the cursor on the change you want to copy, then use:
to copy the change from the current file to the other file.
Customizing the Diff Display
You can customize how vimdiff
displays differences by modifying your .vimrc
configuration file. Here are some options:
Set the Number of Context Lines:
This sets the number of context lines to 3.
Highlight Differences with Custom Colors:
This sets custom colors for added, changed, and deleted lines.
Using vimdiff
with Git
vimdiff
with Gitvimdiff
is particularly useful when used as a merge tool in version control systems like Git. To set vimdiff
as the default diff tool in Git, you can use the following command:
To set vimdiff
as the default merge tool in Git, use:
You can then use git difftool
and git mergetool
to resolve conflicts with vimdiff
.
Practical Example
Suppose you have two files, file1.txt
and file2.txt
, with the following contents:
file1.txt:
file2.txt:
Running vimdiff file1.txt file2.txt
will open Vim with both files side by side, highlighting "cherry" in file1.txt
and "citrus" in file2.txt
as the differing lines. You can then navigate to the differences and use the do
or dp
commands to merge the changes as needed.
Conclusion
vimdiff
is a powerful and flexible tool for comparing and merging files. By leveraging Vim's extensive capabilities, you can efficiently navigate, edit, and resolve differences between files. Whether you are comparing simple text files or resolving complex merge conflicts in a version control system, vimdiff
provides a robust solution for managing differences.
Last updated