Python Read File Line by Line and Count Characters Example
How to extract specific portions of a text file using Python
Updated: 06/30/2020 by Computer Hope
Extracting text from a file is a common job in scripting and programming, and Python makes it easy. In this guide, we'll talk over some simple means to extract text from a file using the Python 3 programming language.
Make sure you're using Python iii
In this guide, we'll be using Python version 3. About systems come pre-installed with Python 2.7. While Python 2.7 is used in legacy code, Python three is the present and time to come of the Python linguistic communication. Unless you have a specific reason to write or support Python 2, we recommend working in Python 3.
For Microsoft Windows, Python 3 can be downloaded from the Python official website. When installing, brand sure the "Install launcher for all users" and "Add Python to PATH" options are both checked, as shown in the image below.
On Linux, you can install Python iii with your package manager. For instance, on Debian or Ubuntu, y'all can install it with the following control:
sudo apt-get update && sudo apt-get install python3
For macOS, the Python 3 installer tin be downloaded from python.org, as linked above. If you are using the Homebrew packet director, it can likewise be installed by opening a terminal window (Applications → Utilities), and running this command:
brew install python3
Running Python
On Linux and macOS, the command to run the Python 3 interpreter is python3. On Windows, if y'all installed the launcher, the command is py. The commands on this folio utilize python3; if you're on Windows, substitute py for python3 in all commands.
Running Python with no options starts the interactive interpreter. For more information about using the interpreter, see Python overview: using the Python interpreter. If you accidentally enter the interpreter, you can exit information technology using the control exit() or quit().
Running Python with a file name will interpret that python program. For instance:
python3 program.py
...runs the programme contained in the file program.py.
Okay, how tin can nosotros use Python to extract text from a text file?
Reading data from a text file
Starting time, let's read a text file. Allow's say we're working with a file named lorem.txt, which contains lines from the Lorem Ipsum example text.
Lorem ipsum dolor sit down amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit down amet pretium tellus. Quisque at dignissim lacus.
Annotation
In all the examples that follow, we work with the four lines of text independent in this file. Re-create and paste the latin text above into a text file, and save it every bit lorem.txt, then you can run the instance lawmaking using this file as input.
A Python program can read a text file using the built-in open up() function. For instance, the Python 3 program below opens lorem.txt for reading in text mode, reads the contents into a string variable named contents, closes the file, and prints the data.
myfile = open("lorem.txt", "rt") # open lorem.txt for reading text contents = myfile.read() # read the unabridged file to string myfile.close() # close the file print(contents) # impress string contents
Here, myfile is the name we requite to our file object.
The "rt" parameter in the open() function means "nosotros're opening this file to read text information"
The hash mark ("#") means that everything on that line is a comment, and it'due south ignored by the Python interpreter.
If you relieve this program in a file called read.py, you tin run it with the following control.
python3 read.py
The command higher up outputs the contents of lorem.txt:
Lorem ipsum dolor sit down amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit down amet pretium tellus. Quisque at dignissim lacus.
Using "with open"
It's important to close your open files as soon equally possible: open the file, perform your operation, and close information technology. Don't leave it open up for extended periods of time.
When you're working with files, it's good practice to use the with open...as compound statement. It'due south the cleanest mode to open a file, operate on information technology, and close the file, all in one easy-to-read block of code. The file is automatically closed when the code block completes.
Using with open...as, we tin can rewrite our program to look like this:
with open ('lorem.txt', 'rt') as myfile: # Open lorem.txt for reading text contents = myfile.read() # Read the entire file to a string print(contents) # Print the string
Note
Indentation is of import in Python. Python programs use white space at the commencement of a line to define scope, such as a block of code. We recommend y'all use iv spaces per level of indentation, and that you use spaces rather than tabs. In the following examples, make sure your code is indented exactly as it'southward presented here.
Example
Save the plan every bit read.py and execute information technology:
python3 read.py
Output:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
Reading text files line-by-line
In the examples and then far, we've been reading in the whole file at once. Reading a full file is no big deal with small files, but by and large speaking, it'southward not a great idea. For ane thing, if your file is bigger than the amount of available memory, you lot'll run into an error.
In almost every case, it'due south a improve idea to read a text file one line at a time.
In Python, the file object is an iterator. An iterator is a blazon of Python object which behaves in sure ways when operated on repeatedly. For instance, you can utilize a for loop to operate on a file object repeatedly, and each time the same operation is performed, yous'll receive a different, or "side by side," result.
Example
For text files, the file object iterates one line of text at a time. Information technology considers one line of text a "unit" of information, so we tin can apply a for...in loop statement to iterate i line at a time:
with open ('lorem.txt', 'rt') as myfile: # Open lorem.txt for reading for myline in myfile: # For each line, read to a string, impress(myline) # and impress the cord.
Output:
Lorem ipsum dolor sit down amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
Notice that nosotros're getting an extra line intermission ("newline") after every line. That's considering two newlines are being printed. The starting time one is the newline at the cease of every line of our text file. The second newline happens considering, past default, print() adds a linebreak of its own at the end of whatever you lot've asked it to impress.
Let's store our lines of text in a variable — specifically, a listing variable — so we tin can await at it more closely.
Storing text data in a variable
In Python, lists are similar to, but not the same equally, an assortment in C or Java. A Python listing contains indexed information, of varying lengths and types.
Example
mylines = [] # Declare an empty list named mylines. with open ('lorem.txt', 'rt') as myfile: # Open lorem.txt for reading text information. for myline in myfile: # For each line, stored as myline, mylines.append(myline) # add its contents to mylines. print(mylines) # Print the list.
The output of this program is a little unlike. Instead of printing the contents of the list, this program prints our list object, which looks like this:
Output:
['Lorem ipsum dolor sit amet, consectetur adipiscing elit.\n', 'Nunc fringilla arcu congue metus aliquam mollis.\northward', 'Mauris nec maximus purus. Maecenas sit amet pretium tellus.\n', 'Quisque at dignissim lacus.\n']
Here, we see the raw contents of the listing. In its raw object class, a list is represented as a comma-delimited list. Hither, each chemical element is represented as a string, and each newline is represented equally its escape grapheme sequence, \north.
Much like a C or Java array, the list elements are accessed by specifying an alphabetize number subsequently the variable name, in brackets. Alphabetize numbers start at zero — other words, the northwardth element of a list has the numeric alphabetize n-i.
Note
If yous're wondering why the index numbers start at zero instead of one, you're not solitary. Calculator scientists have debated the usefulness of zero-based numbering systems in the by. In 1982, Edsger Dijkstra gave his opinion on the subject area, explaining why nada-based numbering is the best fashion to index data in information science. Y'all can read the memo yourself — he makes a compelling argument.
Instance
Nosotros can impress the showtime element of lines past specifying index number 0, independent in brackets after the name of the list:
print(mylines[0])
Output:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis.
Example
Or the 3rd line, by specifying index number 2:
impress(mylines[2])
Output:
Quisque at dignissim lacus.
But if we try to access an alphabetize for which there is no value, we get an mistake:
Example
print(mylines[3])
Output:
Traceback (most contempo call last): File <filename>, line <linenum>, in <module> print(mylines[3]) IndexError: list alphabetize out of range
Instance
A list object is an iterator, and so to print every element of the listing, we can iterate over information technology with for...in:
mylines = [] # Declare an empty list with open ('lorem.txt', 'rt') as myfile: # Open lorem.txt for reading text. for line in myfile: # For each line of text, mylines.append(line) # add together that line to the listing. for element in mylines: # For each chemical element in the list, impress(element) # print it.
Output:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit down amet pretium tellus. Quisque at dignissim lacus.
Just nosotros're still getting extra newlines. Each line of our text file ends in a newline graphic symbol ('\due north'), which is being printed. Also, after printing each line, print() adds a newline of its own, unless you tell it to do otherwise.
We can alter this default behavior by specifying an end parameter in our print() call:
print(element, end='')
By setting finish to an empty string (2 single quotes, with no space), we tell print() to print zero at the finish of a line, instead of a newline character.
Instance
Our revised program looks like this:
mylines = [] # Declare an empty list with open up ('lorem.txt', 'rt') as myfile: # Open file lorem.txt for line in myfile: # For each line of text, mylines.append(line) # add that line to the listing. for element in mylines: # For each element in the list, print(element, end='') # print it without extra newlines.
Output:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
The newlines y'all see here are actually in the file; they're a special grapheme ('\due north') at the end of each line. Nosotros want to get rid of these, then we don't accept to worry virtually them while we process the file.
How to strip newlines
To remove the newlines completely, nosotros tin can strip them. To strip a cord is to remove one or more characters, commonly whitespace, from either the beginning or terminate of the string.
Tip
This process is sometimes also chosen "trimming."
Python 3 string objects have a method chosen rstrip(), which strips characters from the right side of a string. The English linguistic communication reads left-to-right, so stripping from the right side removes characters from the end.
If the variable is named mystring, we can strip its right side with mystring.rstrip(chars), where chars is a cord of characters to strip. For case, "123abc".rstrip("bc") returns 123a.
Tip
When you represent a string in your program with its literal contents, it's called a string literal. In Python (every bit in most programming languages), string literals are always quoted — enclosed on either side by single (') or double (") quotes. In Python, single and double quotes are equivalent; you can apply one or the other, as long as they match on both ends of the cord. It's traditional to stand for a human-readable string (such as Hi) in double-quotes ("Hello"). If you lot're representing a single character (such every bit b), or a single special character such as the newline character (\north), information technology's traditional to use single quotes ('b', '\northward'). For more information about how to employ strings in Python, you can read the documentation of strings in Python.
The statement string.rstrip('\n') will strip a newline character from the right side of cord. The following version of our program strips the newlines when each line is read from the text file:
mylines = [] # Declare an empty list. with open ('lorem.txt', 'rt') equally myfile: # Open up lorem.txt for reading text. for myline in myfile: # For each line in the file, mylines.append(myline.rstrip('\northward')) # strip newline and add together to list. for element in mylines: # For each element in the list, print(element) # print it.
The text is now stored in a listing variable, and then individual lines can be accessed by index number. Newlines were stripped, so we don't take to worry about them. We tin can ever put them back subsequently if nosotros reconstruct the file and write it to disk.
Now, allow'southward search the lines in the list for a specific substring.
Searching text for a substring
Let'south say we want to locate every occurrence of a certain phrase, or even a unmarried letter. For instance, maybe we demand to know where every "e" is. We can accomplish this using the string's find() method.
The listing stores each line of our text every bit a cord object. All string objects take a method, observe(), which locates the kickoff occurrence of a substrings in the string.
Let's employ the detect() method to search for the letter "eastward" in the beginning line of our text file, which is stored in the list mylines. The first element of mylines is a string object containing the beginning line of the text file. This string object has a detect() method.
In the parentheses of find(), we specify parameters. The start and just required parameter is the string to search for, "eastward". The statement mylines[0].observe("e") tells the interpreter to search forward, starting at the get-go of the string, ane character at a time, until it finds the letter of the alphabet "e." When information technology finds one, information technology stops searching, and returns the index number where that "e" is located. If it reaches the terminate of the string, information technology returns -one to indicate nothing was institute.
Case
print(mylines[0].observe("e"))
Output:
iii
The return value "3" tells us that the letter "e" is the fourth character, the "due east" in "Lorem". (Remember, the index is zero-based: alphabetize 0 is the first character, 1 is the second, etc.)
The detect() method takes two optional, additional parameters: a get-go alphabetize and a end index, indicating where in the string the search should begin and cease. For example, string.find("abc", 10, xx) searches for the substring "abc", merely just from the 11th to the 21st character. If end is non specified, find() starts at index start, and stops at the cease of the cord.
Example
For instance, the post-obit argument searchs for "e" in mylines[0], beginning at the fifth character.
print(mylines[0].discover("e", iv))
Output:
24
In other words, starting at the 5th graphic symbol in line[0], the first "e" is located at index 24 (the "e" in "nec").
Example
To start searching at alphabetize 10, and terminate at index thirty:
print(mylines[1].observe("e", 10, 30))
Output:
28
(The first "e" in "Maecenas").
If find() doesn't locate the substring in the search range, it returns the number -ane, indicating failure:
print(mylines[0].find("due east", 25, xxx))
Output:
-i
There were no "east" occurrences between indices 25 and 30.
Finding all occurrences of a substring
Simply what if we want to locate every occurrence of a substring, not just the kickoff 1 nosotros run into? We can iterate over the string, starting from the index of the previous friction match.
In this case, we'll use a while loop to repeatedly find the letter "e". When an occurrence is institute, we telephone call find again, starting from a new location in the string. Specifically, the location of the final occurrence, plus the length of the string (so nosotros tin motion forward past the final i). When find returns -one, or the start index exceeds the length of the string, we finish.
# Build array of lines from file, strip newlines mylines = [] # Declare an empty list. with open ('lorem.txt', 'rt') as myfile: # Open lorem.txt for reading text. for myline in myfile: # For each line in the file, mylines.append(myline.rstrip('\northward')) # strip newline and add together to list. # Locate and print all occurences of letter "due east" substr = "e" # substring to search for. for line in mylines: # string to exist searched index = 0 # current index: grapheme being compared prev = 0 # previous alphabetize: last character compared while index < len(line): # While index has non exceeded string length, alphabetize = line.observe(substr, index) # prepare index to showtime occurrence of "east" if index == -1: # If naught was found, break # go out the while loop. print(" " * (alphabetize - prev) + "east", stop='') # print spaces from previous # match, then the substring. prev = index + len(substr) # remember this position for next loop. alphabetize += len(substr) # increase the index by the length of substr. # (Repeat until index > line length) print('\due north' + line); # Print the original cord under the e's
Output:
e e eastward due east e Lorem ipsum dolor sit amet, consectetur adipiscing elit. e e Nunc fringilla arcu congue metus aliquam mollis. eastward e e e due east e Mauris nec maximus purus. Maecenas sit down amet pretium tellus. e Quisque at dignissim lacus.
Incorporating regular expressions
For complex searches, utilize regular expressions.
The Python regular expressions module is called re. To use information technology in your programme, import the module before you lot use it:
import re
The re module implements regular expressions by compiling a search design into a design object. Methods of this object tin so be used to perform match operations.
For example, allow'south say yous want to search for whatever give-and-take in your document which starts with the alphabetic character d and ends in the letter r. We can reach this using the regular expression "\bd\w*r\b". What does this mean?
character sequence | significant |
---|---|
\b | A word purlieus matches an empty string (anything, including zip at all), but only if it appears before or after a non-word character. "Word characters" are the digits 0 through ix, the lowercase and uppercase letters, or an underscore ("_"). |
d | Lowercase letter of the alphabet d. |
\w* | \w represents any discussion character, and * is a quantifier meaning "zero or more than of the previous character." And so \w* volition match cypher or more than discussion characters. |
r | Lowercase letter of the alphabet r. |
\b | Word purlieus. |
So this regular expression will match any cord that tin be described equally "a word boundary, so a lowercase 'd', and then zero or more than word characters, then a lowercase 'r', then a word purlieus." Strings described this way include the words destroyer, dour, and doctor, and the abbreviation dr.
To use this regular expression in Python search operations, nosotros first compile information technology into a pattern object. For instance, the post-obit Python statement creates a pattern object named pattern which we tin use to perform searches using that regular expression.
pattern = re.compile(r"\bd\w*r\b")
Note
The letter r before our string in the statement above is important. It tells Python to interpret our string equally a raw string, exactly every bit we've typed it. If we didn't prefix the cord with an r, Python would interpret the escape sequences such every bit \b in other ways. Whenever yous demand Python to translate your strings literally, specify information technology as a raw string by prefixing information technology with r.
Now we can apply the pattern object'south methods, such as search(), to search a string for the compiled regular expression, looking for a lucifer. If information technology finds i, it returns a special effect called a lucifer object. Otherwise, it returns None, a built-in Python abiding that is used like the boolean value "false".
import re str = "Good morn, doctor." pat = re.compile(r"\bd\westward*r\b") # compile regex "\bd\w*r\b" to a pattern object if pat.search(str) != None: # Search for the pattern. If found, print("Found it.")
Output:
Institute it.
To perform a case-insensitive search, you can specify the special constant re.IGNORECASE in the compile step:
import re str = "How-do-you-do, Doctor." pat = re.compile(r"\bd\w*r\b", re.IGNORECASE) # upper and lowercase will match if pat.search(str) != None: print("Constitute information technology.")
Output:
Constitute it.
Putting it all together
So now we know how to open a file, read the lines into a list, and locate a substring in any given list element. Allow'south utilise this noesis to build some example programs.
Print all lines containing substring
The program below reads a log file line by line. If the line contains the give-and-take "mistake," it is added to a list called errors. If non, it is ignored. The lower() string method converts all strings to lowercase for comparison purposes, making the search example-insensitive without altering the original strings.
Notation that the notice() method is chosen direct on the result of the lower() method; this is chosen method chaining. Also, note that in the impress() statement, we construct an output cord by joining several strings with the + operator.
errors = [] # The list where nosotros will store results. linenum = 0 substr = "mistake".lower() # Substring to search for. with open ('logfile.txt', 'rt') as myfile: for line in myfile: linenum += 1 if line.lower().find(substr) != -1: # if case-insensitive friction match, errors.append("Line " + str(linenum) + ": " + line.rstrip('\n')) for err in errors: impress(err)
Input (stored in logfile.txt):
This is line 1 This is line ii Line 3 has an error! This is line 4 Line 5 also has an error!
Output:
Line iii: Line three has an fault! Line five: Line five also has an mistake!
Extract all lines containing substring, using regex
The program below is similar to the above plan, but using the re regular expressions module. The errors and line numbers are stored as tuples, e.g., (linenum, line). The tuple is created past the additional enclosing parentheses in the errors.append() argument. The elements of the tuple are referenced like to a list, with a zero-based index in brackets. Equally constructed hither, err[0] is a linenum and err[1] is the associated line containing an error.
import re errors = [] linenum = 0 design = re.compile("error", re.IGNORECASE) # Compile a case-insensitive regex with open up ('logfile.txt', 'rt') as myfile: for line in myfile: linenum += 1 if design.search(line) != None: # If a match is establish errors.append((linenum, line.rstrip('\due north'))) for err in errors: # Iterate over the list of tuples impress("Line " + str(err[0]) + ": " + err[1])
Output:
Line 6: Mar 28 09:ten:37 Error: cannot contact server. Connection refused. Line 10: Mar 28 10:28:fifteen Kernel error: The specified location is not mounted. Line 14: Mar 28 eleven:06:30 ERROR: usb one-1: can't fix config, exiting.
Excerpt all lines containing a telephone number
The plan below prints any line of a text file, info.txt, which contains a US or international telephone number. It accomplishes this with the regular expression "(\+\d{ane,2})?[\s.-]?\d{3}[\s.-]?\d{iv}". This regex matches the post-obit phone number notations:
- 123-456-7890
- (123) 456-7890
- 123 456 7890
- 123.456.7890
- +91 (123) 456-7890
import re errors = [] linenum = 0 pattern = re.compile(r"(\+\d{1,2})?[\southward.-]?\d{three}[\s.-]?\d{4}") with open ('info.txt', 'rt') as myfile: for line in myfile: linenum += one if blueprint.search(line) != None: # If pattern search finds a match, errors.append((linenum, line.rstrip('\northward'))) for err in errors: impress("Line ", str(err[0]), ": " + err[1])
Output:
Line 3 : My phone number is 731.215.8881. Line vii : You can achieve Mr. Walters at (212) 558-3131. Line 12 : His agent, Mrs. Kennedy, tin can exist reached at +12 (123) 456-7890 Line 14 : She tin too be contacted at (888) 312.8403, extension 12.
Search a lexicon for words
The program below searches the dictionary for any words that first with h and end in pe. For input, information technology uses a dictionary file included on many Unix systems, /usr/share/dict/words.
import re filename = "/usr/share/dict/words" pattern = re.compile(r"\bh\w*pe$", re.IGNORECASE) with open(filename, "rt") as myfile: for line in myfile: if design.search(line) != None: print(line, end='')
Output:
Hope heliotrope hope hornpipe horoscope hype
Source: https://www.computerhope.com/issues/ch001721.htm
0 Response to "Python Read File Line by Line and Count Characters Example"
Post a Comment