Organizing the mess: How was the development of DevOps Go Duplicate Detector?

Recently, I faced one of those challenges that every professional who deals with large volumes of data knows well: freeing up space on my workstation (lol). But in my case, it wasn't "a few lost files" - it was 1TB of data! The idea of manually reviewing everything to delete the duplicates (which I was sure existed) just didn't make sense.

I searched the internet for free and paid applications, but none of them worked for me, the closest being CleanMyMac - yes, the fact that I use MacOS made the process a little more difficult. From then on, I realized that I needed to create an intelligent and scalable solution to automate this task, and also take the opportunity to practice the lessons from the FullCycle MBA in Architecture and the Golang Expert Course.

As a technology consultant, I already knew the classic pains of file management: duplicate folders, repeated files, messy names and directories full of digital junk, making any organization difficult. Imagine this compounded by years of use, backups and collections of various documents from several old external hard drives on a single machine... I always thought, one day I'll have time to organize this mess. time never came and the mess grew. It was time to act...

Secure and automated organization

I decided to develop the DevOps Go Duplicate Detector in Go, betting on a set of really useful features for everyday life - from finding and removing duplicates by analyzing the content (using SHA-256), to automatic organization by name patterns, recognition of empty folders, intelligent renaming and detailed reports of the actions taken. All this with data security in mind and giving the user control: before removing anything, there is a "dry-run" mode for simulating actions.

The basis for identifying duplicates is to calculate the SHA-256 hash of each file. Here's how it's implemented:

*// scanner.go*
func (fs *FileScanner) calculateFileHash(filePath string) (string, error) {
    file, err := os.Open(filePath)
    if err != nil {
        return "", err
    }
    defer file.Close()
    hash := sha256.New()
    if _, err := io.Copy(hash, file); err != nil {
        return "", err
    }
    return fmt.Sprintf("%x", hash.Sum(nil)), nil
}

Each file in the analysis generates its own hash, which guarantees accuracy and allows identical files to be grouped together, regardless of their name.

The algorithm recognizes copy name patterns, always prioritizing the original file:

*// duplicate_detector.go*
func (dd *DuplicateDetector) SelectFilesToDelete(files []*filesystem.FileInfo) []*filesystem.FileInfo {
    *// Separa originais e cópias por nome// Mantém arquivo mais antigo ou com nome original// Remove cópias detectadas*
}

Copy patterns are detected by regular expressions, supporting names such as file (2).txt, file - Copy.txt and many others typical of Windows/macOS.

It makes no sense to run massive operations without predictability. That's why all the main functions accept the --dry-runflag:

*// RemoveEmptyDirectories*
func (fs *FileScanner) RemoveEmptyDirectories(directories []string, dryRun bool) error {
    for _, dir := range directories {
        if !dryRun {
            os.Remove(dir)
        }
    }
    return nil
}

This makes it possible to simulate every operation before modifying any file, ensuring safety.

Scaling to 1TB

The biggest technical challenge was to ensure that the tool worked efficiently even with large volumes of files. Processing 1TB is no joke. This was only possible thanks to the clear organization of the layers - a central concept of Clean Architecture - which separates responsibilities and makes it easier to extend or optimize each part of the system without unnecessary coupling.

In the project, the cmd/main.go file only takes care of parsing the arguments, validating them and choosing the operation. It doesn't know how the details of hashing, removal or detection work. This ensures that CLI logic never mixes with processing logic, increasing the maintainability and testability of the project.

// cmd/main.go
cfg, err := config.ParseFlags()
// ...
operationType := config.GetOperationType(cfg)
switch operationType {
case "delete-duplicates":
    result, err = operations.DeleteDuplicates(cfg, fileScanner)
}

The internal/detector package only takes care of the logic for identifying duplicates. The algorithm can be optimized (to run in parallel or modularized for multiple types of analysis) without affecting the rest of the application:

// internal/detector/duplicate_detector.go
func (dd *DuplicateDetector) FindDuplicates(files []*filesystem.FileInfo) map[string][]*filesystem.FileInfo { ... }

As mentioned before, everything related to disk access and SHA-256 hash calculation is isolated in internal/filesystem/scanner.go, and can even be replaced by another mechanism (API cloud, bank, etc) if necessary.

// internal/filesystem/scanner.go
func (fs *FileScanner) calculateFileHash(filePath string) (string, error) { ... }

When I needed to optimize performance for 1TB, I was able to tinker with isolated parts - such as the file scanner, without having to change business logic, parsing or CLI. This speeds up debugging, evolution, refactoring and ensures a low risk of chain bugs, as well as making it clear "who does what" within the code base.

The scalability of this project goes beyond fast algorithms. It was the Clean Architecture structure, with strict separation of layers, that allowed us to evolve, test and optimize easily without losing the readability and robustness of the system. This is clear from the folder hierarchy and the clear responsibility of each module.

NOTE: I ran tests on Linux, macOS and even WSL2, because I know that in real life no one has a 100% standard environment, however on Linux and Windows I didn't use a giant mass of data, but at runtime, the result for me was satisfactory.

Use of Clean Architecture and Clean Code concepts

I built the project in an attempt to follow Clean Architecture principles, as I wanted something extensible and easy to maintain. I separated each module well: from the file scanner to the detection algorithms and concrete operations. This made it easy to add new functions - for example, deleting only files of specific types or merging duplicate folders - in a modular way.

We started with the structure of the project, which reflects the concern for maintenance and extensibility with the use of Separation of Responsibilities (SRP), this ****layered structure prevents business logic from getting mixed up with file system details or argument parsing. See how each directory has a specific focus in the project:

├── cmd/                    # main.go (entrada)
├── internal/
│   ├── config/             # config.go (flags e validações)
│   ├── detector/           # duplicate_detector.go (detecção duplicados)
│   ├── filesystem/         # scanner.go (varredura, hashes)
│   ├── operations/         # file_operations.go, duplicate_operations.go
│   └── types/              # tipos de dados

Each layer has a clear function - anyone who wants to evolve the tool knows where to go.

Clear nomenclature and single-purpose functions

In the duplicate_detector.go file, all the functions have names that make it clear what they do:

*// scanner.go*
func (fs *FileScanner) ScanDirectory(directory string, recursive bool) ([]*FileInfo, error) { ... }

func (fs *FileScanner) CalculateFileHash(filePath string) (string, error) { ... }

Each function is small, does only one thing, and has descriptive verb names - one of the pillars of Clean Code.

Safe error handling and descriptive messages

Another point of Clean Code is contextualized error messages, an example of good practice:

*// scanner.go*
if err != nil {
    return fmt.Errorf("erro ao calcular hash de %s: %v", path, err)
}

This avoids generic errors and facilitates both debugging and use in production.

Reusable utility functions

Functions such as formatBytes(), getDeletionReason () and NormalizeFileExtension() are kept in utility files, promoting reuse and clarity:

*// internal/operations/common.go*
func formatBytes(bytes int64) string { ... }

Comments and Documentation

Each public function has an explanatory comment, and the modules have architecture documentation to guide understanding of the structure:

*// duplicate_detector.go// DuplicateDetector é responsável por detectar arquivos duplicados*
type DuplicateDetector struct { ... }

These small details ensure that other developers can adapt, evolve or test new functions with ease, maintaining the quality of the code over time.

With regard to documentation, I also provide a README.MD file with clear information on how to use the tool. After all, the aim was not for it to be used only by me, but by anyone who needed it, and because it is Open Source, anyone who uses it can analyze and adapt the code to their needs.

Practical Use

The interface is focused on simple commands, for example:

*# Simulação (dry-run) de duplicados*
./devops-go-duplicate-detector --directory ~/Downloads --del-dup --dry-run --verbose

*# Remoção de duplicados de fato*
./devops-go-duplicate-detector --directory ~/Downloads --del-dup --recursive

Other operations include cleaning by type, intelligent renaming and merging duplicate folders. Everything can be seen in the readme or with the --help command.

Detailed Operations Report

Each run generates logs of what was processed, removed or changed:

This makes it easy to review everything before making final decisions.

*// Em operations/duplicate_operations.go:*
result.DeletedFiles = append(result.DeletedFiles, types.DeletedFileInfo{
    Path: file.Path,
    Size: file.Size,
    Reason: reason,
    Hash: hash,
    Timestamp: time.Now(),
})

Some lessons learned and practical recommendations

Throughout development, I realized that the "dry-run" mode would be a lifesaver: especially for those who are afraid of losing important files - like me (lol). The report generated at the end of each action is essential for reviewing operations and preventing any disaster.

Another tip: before running a massive operation, always review the generated report. And if you have any doubts about a file, test it on a small folder first. The tool has been designed to be developer-friendly, but responsibility for the data remains with the user.

Making the Routine Lighter

Creating DevOps Go Duplicate Detector was also an exercise in simplifying chaotic routines. In the end, the tool not only saved me time but also helped me rethink backup processes, organization and automation of basic tasks that can become a nightmare without the right support.

More than code, this project translated my real-life experience into a practical solution. Each feature was born out of a concrete need and was designed to be secure, extensible and very transparent.

If you want to know more about the project, the technical details or even contribute, everything is open on GitHub. I believe that sharing these solutions is a fundamental part of evolving together.

So, are you brave enough to try out the tool? Let me know in the comments and if you did, how did it feel?

If you need any adaptation, a shortened version or one focused on specific aspects, you can ask, or open an issue.

https://github.com/nilsonvieira/devops-go-duplicate-detector

Organizing the mess: How was the development of DevOps Go Duplicate Detector?

Secure and automated organization

Scaling to 1TB

Use of Clean Architecture and Clean Code concepts

Clear nomenclature and single-purpose functions

Safe error handling and descriptive messages

Reusable utility functions

Comments and Documentation

Some lessons learned and practical recommendations

Making the Routine Lighter

Comments

English Articles

GitOps in Multi-Tenant Environments: Building Scalable and Secure Architectures for Kubernetes Clusters

More from this blog

GitOps na Prática: Lições de Escalabilidade, Segurança e Confiabilidade em Ambientes Cloud-Native

GitOps in Practice: Lessons in Scalability, Security, and Reliability in Cloud-Native Environments

GitOps em Ambientes Multi-Tenant: Construindo Arquiteturas Escaláveis e Seguras para Clusters Kubernetes

GitOps in Multi-Tenant Environments: Building Scalable and Secure Architectures for Kubernetes Clusters

Command Palette

Secure and automated organization

Scaling to 1TB

Use of Clean Architecture and Clean Code concepts

Clear nomenclature and single-purpose functions

Safe error handling and descriptive messages

Reusable utility functions

Comments and Documentation

Some lessons learned and practical recommendations

Making the Routine Lighter

Comments

English Articles

GitOps in Multi-Tenant Environments: Building Scalable and Secure Architectures for Kubernetes Clusters

More from this blog