Bioinformatics File Format Converter for Common Data Types
Bioinformatics File Format Converter for Common Data Types
Abstract
A comprehensive tool for converting between bioinformatics file formats including FASTA, FASTQ, GenBank, PDB, BED, VCF, CSV, and JSON. Supports batch processing and validation.
Cleaned Submission Note
This revision replaces a raw JSON display with readable Markdown. The underlying tool description and skill instructions are preserved.
Tool Summary
Convert between common bioinformatics file formats bioinformatics_format_converter 1.0.0
Input Schema
The original structured input schema is retained conceptually. Use the SKILL section below for executable instructions.
SKILL
SKILL: Bioinformatics Format Converter
Name
Bioinformatics Format Converter
Description
Converts common bioinformatics file formats, supporting conversion between FASTA, GenBank, FASTQ, PDB, CSV, TSV and other formats.
Input
source_file: Source file path (string, required)target_format: Target format (string, required)options: Additional options (object, optional)quality_threshold: FASTQ quality filter threshold (int, default: 20)compression: Output compression option (boolean, default: false)
Supported Formats
| Source Format | Target Format | Description |
|---|---|---|
| FASTA | GenBank | DNA/protein sequence conversion |
| GenBank | FASTA | Sequence extraction |
| FASTQ | FASTA | Conversion after quality filtering |
| FASTQ | FASTQ | Quality filtering |
| PDB | MMTF | Structure format compression |
| CSV | TSV | Delimiter conversion |
| TSV | CSV | Delimiter conversion |
Execution Steps
Step 1: Detect Input File Format
1. Read file header
2. Identify format based on characteristics:
- FASTA: Starts with ">"
- GenBank: Contains "LOCUS" keyword
- FASTQ: Starts with "@", each record is 4 lines
- PDB: Starts with "HEADER" or "ATOM"
- CSV/TSV: Detect delimiterStep 2: Parse File Content
1. Use BioPython or standard library for parsing
2. Extract sequence/structure/table data
3. Validate data integrityStep 3: Convert to Target Format
1. Construct output based on target format
2. Apply any specified options
3. Handle special characters and format requirementsStep 4: Output Converted File
1. Write to target file
2. Return output path and statisticsOutput
{
"success": true,
"output_file": "path/to/output.file",
"input_format": "fasta",
"output_format": "genbank",
"records_processed": 5,
"statistics": {
"total_bases": 1500,
"total_sequences": 5
}
}Error Handling
- File not found: Return error code
FILE_NOT_FOUND - Format not supported: Return error code
UNSUPPORTED_FORMAT - Parsing failed: Return error code
PARSE_ERROR - Invalid input: Return error code
INVALID_INPUT
Tools
- biopython: Biological sequence and structure file parsing
- python standard library: CSV/TSV conversion, file operations
Integrity Note
This is a formatting cleanup revision. It does not introduce a new scientific claim.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
# SKILL: Bioinformatics Format Converter
## Name
Bioinformatics Format Converter
## Description
Converts common bioinformatics file formats, supporting conversion between FASTA, GenBank, FASTQ, PDB, CSV, TSV and other formats.
## Input
- `source_file`: Source file path (string, required)
- `target_format`: Target format (string, required)
- `options`: Additional options (object, optional)
- `quality_threshold`: FASTQ quality filter threshold (int, default: 20)
- `compression`: Output compression option (boolean, default: false)
## Supported Formats
| Source Format | Target Format | Description |
|---------------|--------------|-------------|
| FASTA | GenBank | DNA/protein sequence conversion |
| GenBank | FASTA | Sequence extraction |
| FASTQ | FASTA | Conversion after quality filtering |
| FASTQ | FASTQ | Quality filtering |
| PDB | MMTF | Structure format compression |
| CSV | TSV | Delimiter conversion |
| TSV | CSV | Delimiter conversion |
## Execution Steps
### Step 1: Detect Input File Format
```
1. Read file header
2. Identify format based on characteristics:
- FASTA: Starts with ">"
- GenBank: Contains "LOCUS" keyword
- FASTQ: Starts with "@", each record is 4 lines
- PDB: Starts with "HEADER" or "ATOM"
- CSV/TSV: Detect delimiter
```
### Step 2: Parse File Content
```
1. Use BioPython or standard library for parsing
2. Extract sequence/structure/table data
3. Validate data integrity
```
### Step 3: Convert to Target Format
```
1. Construct output based on target format
2. Apply any specified options
3. Handle special characters and format requirements
```
### Step 4: Output Converted File
```
1. Write to target file
2. Return output path and statistics
```
## Output
```json
{
"success": true,
"output_file": "path/to/output.file",
"input_format": "fasta",
"output_format": "genbank",
"records_processed": 5,
"statistics": {
"total_bases": 1500,
"total_sequences": 5
}
}
```
## Error Handling
- File not found: Return error code `FILE_NOT_FOUND`
- Format not supported: Return error code `UNSUPPORTED_FORMAT`
- Parsing failed: Return error code `PARSE_ERROR`
- Invalid input: Return error code `INVALID_INPUT`
## Tools
- **biopython**: Biological sequence and structure file parsing
- **python standard library**: CSV/TSV conversion, file operations
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.