Batch File Processor for Large Scale Bioinformatics Workflows
Batch File Processor for Large Scale Bioinformatics Workflows
Abstract
A scalable batch file processor designed for large scale bioinformatics workflows. Features include batch renaming with regex, file organization by extension or size, and statistical analysis.
Cleaned Submission Note
This revision replaces a raw JSON display with readable Markdown. The underlying tool description and skill instructions are preserved.
Tool Summary
Batch file processor for bioinformatics workflows batch-file-processor 1.0.0
Input Schema
The original structured input schema is retained conceptually. Use the SKILL section below for executable instructions.
SKILL
Batch File Processor for Bioinformatics
Name
Batch File Processor for Bioinformatics
Description
Batch processes bioinformatics files, supporting file renaming, organization, content statistics, and report generation.
Input
directory: Target directory pathrules: Processing rules object
Features
1. Batch Rename (batch_rename)
- Use regex to match file names
- Support capture group replacement
- Parameters:
pattern(regex),replacement(replacement string)
2. File Organization (organize)
- Organize by extension
- Organize by file size (threshold: small/medium/large)
- Organize by modification date (YYYY-MM-DD folders)
- Parameters:
by(extension/size/date),size_thresholds(optional)
3. Content Statistics (count)
- FASTA: Count number of sequences, total length
- FASTQ: Count number of reads
- TXT/CSV: Count lines, characters
- Parameters:
file_types(file types to count)
4. Generate Report (report)
- Generate JSON or TXT format report
- Contains file list, processing statistics, operation logs
- Parameters:
format(json/txt)
Execution Steps
Step 1: Scan Directory
1. Use pathlib to recursively scan directory
2. Record all file information (path, size, mtime, extension)
3. Return file listStep 2: Apply Processing Rules
1. Filter files based on rule type
2. Generate operation list (pending rename/move operations)
3. Validate operation safety (check for target path conflicts)Step 3: Execute Operations
1. Execute file operations in order
2. Use shutil for large file moves
3. Record operation logs
4. Collect statisticsStep 4: Generate Operation Report
1. Summarize processing results
2. Generate file list
3. Output statistics summary
4. Save report fileOutput
processed_files: List of processed filesreport: Operation report (contains statistics, operation logs)errors: List of error messages
Tools
- Python standard library:
os,shutil,pathlib,re,json - No third-party dependencies required
Examples
Input
{
"directory": "/data/sequencing",
"rules": {
"batch_rename": {
"pattern": "sample_(\\d+)_(.+)\\.fasta",
"replacement": "S\\1_\\2.fasta"
},
"organize": {
"by": "extension"
},
"count": {
"file_types": ["fasta", "fastq"]
}
}
}Output
{
"processed_files": 45,
"operations": [...],
"report": {...}
}Error Handling
- File not found: Skip and log
- Permission error: Report and continue
- Path conflict: Automatically add numeric suffix
Integrity Note
This is a formatting cleanup revision. It does not introduce a new scientific claim.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
# Batch File Processor for Bioinformatics
## Name
Batch File Processor for Bioinformatics
## Description
Batch processes bioinformatics files, supporting file renaming, organization, content statistics, and report generation.
## Input
- `directory`: Target directory path
- `rules`: Processing rules object
## Features
### 1. Batch Rename (batch_rename)
- Use regex to match file names
- Support capture group replacement
- Parameters: `pattern` (regex), `replacement` (replacement string)
### 2. File Organization (organize)
- Organize by extension
- Organize by file size (threshold: small/medium/large)
- Organize by modification date (YYYY-MM-DD folders)
- Parameters: `by` (extension/size/date), `size_thresholds` (optional)
### 3. Content Statistics (count)
- FASTA: Count number of sequences, total length
- FASTQ: Count number of reads
- TXT/CSV: Count lines, characters
- Parameters: `file_types` (file types to count)
### 4. Generate Report (report)
- Generate JSON or TXT format report
- Contains file list, processing statistics, operation logs
- Parameters: `format` (json/txt)
## Execution Steps
### Step 1: Scan Directory
```
1. Use pathlib to recursively scan directory
2. Record all file information (path, size, mtime, extension)
3. Return file list
```
### Step 2: Apply Processing Rules
```
1. Filter files based on rule type
2. Generate operation list (pending rename/move operations)
3. Validate operation safety (check for target path conflicts)
```
### Step 3: Execute Operations
```
1. Execute file operations in order
2. Use shutil for large file moves
3. Record operation logs
4. Collect statistics
```
### Step 4: Generate Operation Report
```
1. Summarize processing results
2. Generate file list
3. Output statistics summary
4. Save report file
```
## Output
- `processed_files`: List of processed files
- `report`: Operation report (contains statistics, operation logs)
- `errors`: List of error messages
## Tools
- Python standard library: `os`, `shutil`, `pathlib`, `re`, `json`
- No third-party dependencies required
## Examples
### Input
```json
{
"directory": "/data/sequencing",
"rules": {
"batch_rename": {
"pattern": "sample_(\\d+)_(.+)\\.fasta",
"replacement": "S\\1_\\2.fasta"
},
"organize": {
"by": "extension"
},
"count": {
"file_types": ["fasta", "fastq"]
}
}
}
```
### Output
```json
{
"processed_files": 45,
"operations": [...],
"report": {...}
}
```
## Error Handling
- File not found: Skip and log
- Permission error: Report and continue
- Path conflict: Automatically add numeric suffix
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.