Given a text file that contains several duplicate lines, the task is to remove all repeated lines and produce an output file containing only unique lines, while keeping their original order.
Example: Input file(myfile.txt)
This is a sample line.
Python is a powerful language.
This is a sample line.
Output:
This is a sample line.
Python is a powerful language.
Below are several methods to eliminate repeated lines from a file:
Using a Set
This method removes duplicate lines by storing only unique lines in a Python set.
seen = set()
with open("myfile.txt", "r") as f_in, open("output.txt", "w") as f_out:
for ln in f_in:
if ln not in seen:
f_out.write(ln)
seen.add(ln)
Output
This is a sample line.
Python is a powerful language.
Explanation:
- seen = set(): Stores all unique lines encountered
- for ln in f_in: Reads every line one by one
- if ln not in seen: Checks if the line is unique
- f_out.write(ln): Writes unique line to output file
- seen.add(ln): Marks the line as seen.
Using a List
This method removes repeated lines by checking each line before adding it to a list, ensuring only unique lines are kept.
seen = []
with open("myfile.txt", "r") as f_in, open("output.txt", "w") as f_out:
for ln in f_in:
if ln not in seen:
f_out.write(ln)
seen.append(ln)
Output
This is a sample line.
Python is a powerful language.
Explanation:
- f_out.write(ln): Writes only unique lines
- seen.append(ln): Saves the line for comparison
Using Pandas
This method removes duplicate lines by loading the file into a Pandas DataFrame and using its built-in drop_duplicates() function.
import pandas as pd
df = pd.read_csv("myfile.txt", header=None)
df.drop_duplicates(inplace=True)
df.to_csv("output.txt", index=False, header=False)
Output
This is a sample line.
Python is a powerful language.
Explanation:
- read_csv(...): Reads text lines into a DataFrame
- drop_duplicates(): Removes duplicate rows
- to_csv(...): Saves cleaned data back to a file