spacr.sequencing

Module Contents

spacr.sequencing.map_sequences_to_names(csv_file, sequences, rc)[source]

Maps DNA sequences to their corresponding names based on a CSV file.

Parameters:

csv_file (str) – Path to the CSV file containing ‘sequence’ and ‘name’ columns.
sequences (list of str) – List of DNA sequences to map.
rc (bool) – If True, reverse complement the sequences in the CSV before mapping.

Returns:

A list of names corresponding to the input sequences. If a sequence is not found,: pd.NA is returned in its place.

Return type:

list

Notes

The CSV file must contain columns named ‘sequence’ and ‘name’.
If rc is True, sequences in the CSV will be reverse complemented prior to mapping.
Sequences in sequences are not altered—only sequences in the CSV are reverse complemented.

spacr.sequencing.save_df_to_hdf5(df, hdf5_file, key='df', comp_type='zlib', comp_level=5)[source]

Saves a pandas DataFrame to an HDF5 file, optionally appending to an existing dataset.

Parameters:

df (pd.DataFrame) – The DataFrame to save.
hdf5_file (str) – Path to the target HDF5 file.
key (str, optional) – Key under which to store the DataFrame. Defaults to ‘df’.
comp_type (str, optional) – Compression algorithm to use (e.g., ‘zlib’, ‘bzip2’, ‘blosc’). Defaults to ‘zlib’.
comp_level (int, optional) – Compression level (0–9). Higher values yield better compression at the cost of speed. Defaults to 5.

Returns:

None

Notes

If the specified key already exists in the HDF5 file, the new DataFrame is appended to it.
The combined DataFrame is saved in ‘table’ format to support appending and querying.
Errors encountered during saving are printed to standard output.

spacr.sequencing.save_unique_combinations_to_csv(unique_combinations, csv_file)[source]

Saves or appends a DataFrame of unique gRNA combinations to a CSV file, aggregating duplicates.

Parameters:

unique_combinations (pd.DataFrame) – DataFrame containing ‘rowID’, ‘columnID’, and ‘grna_name’ columns, along with associated count or metric columns.
csv_file (str) – Path to the CSV file where data will be saved.

Returns:

None

Notes

If the file exists, it reads the existing contents and appends the new data.
Duplicate combinations (same ‘rowID’, ‘columnID’, ‘grna_name’) are summed.
The resulting DataFrame is saved with index included.
Any exception during the process is caught and printed to stdout.

spacr.sequencing.save_qc_df_to_csv(qc_df, qc_csv_file)[source]

Saves or appends a QC (quality control) DataFrame to a CSV file by summing overlapping entries.

Parameters:

qc_df (pd.DataFrame) – DataFrame containing numeric QC metrics (e.g., counts, read stats).
qc_csv_file (str) – Path to the CSV file where the QC data will be saved.

Returns:

None

Notes

If the file exists, it reads the existing QC data and adds the new values to it (element-wise).
If the file doesn’t exist, it creates a new one.
The final DataFrame is saved without including the index.
Any exception is caught and logged to stdout.

spacr.sequencing.extract_sequence_and_quality(sequence, quality, start, end)[source]

Extracts a subsequence and its corresponding quality scores.

Parameters:

sequence (str) – DNA sequence string.
quality (str) – Quality string corresponding to the sequence.
start (int) – Start index of the region to extract.
end (int) – End index of the region to extract (exclusive).

Returns:

(subsequence, subquality) as strings.

Return type:

tuple

spacr.sequencing.create_consensus(seq1, qual1, seq2, qual2)[source]

Constructs a consensus DNA sequence from two reads with associated quality scores.

Parameters:

seq1 (str) – First DNA sequence.
qual1 (str) – Quality scores for seq1 (as ASCII characters or integer-encoded).
seq2 (str) – Second DNA sequence.
qual2 (str) – Quality scores for seq2.

Returns:

Consensus sequence, selecting the base with the highest quality at each position.: If one base is ‘N’, the non-‘N’ base is chosen regardless of quality.

Return type:

str

spacr.sequencing.get_consensus_base(bases)[source]

Selects the most reliable base from a list of two base-quality pairs.

Parameters:: bases (list of tuples) – Each tuple contains (base, quality_score), expected length is 2.
Returns:: The consensus base. Prefers non-‘N’ bases and higher quality scores.
Return type:: str

spacr.sequencing.reverse_complement(seq)[source]

Computes the reverse complement of a DNA sequence.

Parameters:: seq (str) – Input DNA sequence.
Returns:: Reverse complement of the input sequence.
Return type:: str

spacr.sequencing.process_chunk(chunk_data)[source]

Processes a chunk of sequencing reads to extract and map barcode sequences to corresponding names.

This function handles both single-end and paired-end FASTQ data. It searches for a target barcode sequence in each read, extracts a consensus region around it, applies a regex to extract barcodes, and maps those to known IDs using reference CSVs. Quality control data and unique combinations are also computed.

Parameters:

chunk_data (tuple) –

Contains either 9 or 10 elements:

For paired-end mode (10 elements):

r1_chunk (list): List of strings, each 4-line block from R1 FASTQ.
r2_chunk (list): List of strings, each 4-line block from R2 FASTQ.
regex (str): Regex pattern with named groups (‘rowID’, ‘columnID’, ‘grna’).
target_sequence (str): Sequence to anchor barcode extraction.
offset_start (int): Offset from target_sequence to start consensus extraction.
expected_end (int): Length of the region to extract.
column_csv (str): Path to column barcode reference CSV.
grna_csv (str): Path to gRNA barcode reference CSV.
row_csv (str): Path to row barcode reference CSV.
fill_na (bool): Whether to fill unmapped names with raw barcode sequences.

For single-end mode (9 elements):

Same as above, but r2_chunk is omitted.

Returns:

df (pd.DataFrame): Full dataframe with columns: [‘read’, ‘column_sequence’, ‘columnID’, ‘row_sequence’, ‘rowID’, ‘grna_sequence’, ‘grna_name’]
unique_combinations (pd.DataFrame): Count of each unique (rowID, columnID, grna_name) triplet.
qc_df (pd.DataFrame): Summary of missing values and total reads.

Return type:

tuple

spacr.sequencing.saver_process(save_queue, hdf5_file, save_h5, unique_combinations_csv, qc_csv_file, comp_type, comp_level)[source]

Continuously reads data from a multiprocessing queue and saves it to disk in various formats.

This function is intended to run in a separate process. It terminates when it receives the “STOP” sentinel value.

Parameters:

save_queue (multiprocessing.Queue) – Queue containing tuples of (df, unique_combinations, qc_df).
hdf5_file (str) – Path to the HDF5 file to store full reads (only used if save_h5 is True).
save_h5 (bool) – Whether to save the full reads DataFrame to HDF5.
unique_combinations_csv (str) – Path to the CSV file for aggregated barcode combinations.
qc_csv_file (str) – Path to the CSV file for quality control statistics.
comp_type (str) – Compression algorithm for HDF5 (e.g., ‘zlib’).
comp_level (int) – Compression level for HDF5.

spacr.sequencing.paired_read_chunked_processing(r1_file, r2_file, regex, target_sequence, offset_start, expected_end, column_csv, grna_csv, row_csv, save_h5, comp_type, comp_level, hdf5_file, unique_combinations_csv, qc_csv_file, chunk_size=10000, n_jobs=None, test=False, fill_na=False)[source]

Processes paired-end FASTQ files in chunks to extract barcoded sequences and generate consensus reads.

This function identifies sequences matching a regular expression in both R1 and R2 reads, extracts barcodes, and maps them to user-defined identifiers. Processed data is saved incrementally using a separate process.

Parameters:

r1_file (str) – Path to the gzipped R1 FASTQ file.
r2_file (str) – Path to the gzipped R2 FASTQ file.
regex (str) – Regular expression with named capture groups: ‘rowID’, ‘columnID’, and ‘grna’.
target_sequence (str) – Anchor sequence to align from.
offset_start (int) – Offset from anchor to start consensus extraction.
expected_end (int) – Length of the consensus region to extract.
column_csv (str) – Path to CSV file mapping column barcode sequences to IDs.
grna_csv (str) – Path to CSV file mapping gRNA barcode sequences to names.
row_csv (str) – Path to CSV file mapping row barcode sequences to IDs.
save_h5 (bool) – Whether to save the full reads DataFrame to HDF5.
comp_type (str) – Compression algorithm for HDF5 (e.g., ‘zlib’).
comp_level (int) – Compression level for HDF5.
hdf5_file (str) – Path to the HDF5 output file.
unique_combinations_csv (str) – Path to CSV file for saving unique row/column/gRNA combinations.
qc_csv_file (str) – Path to CSV file for saving QC summary (e.g., NaN counts).
chunk_size (int, optional) – Number of reads per batch. Defaults to 10000.
n_jobs (int, optional) – Number of parallel workers. Defaults to cpu_count() - 3.
test (bool, optional) – If True, processes only a single chunk and prints the result. Defaults to False.
fill_na (bool, optional) – If True, fills unmapped IDs with raw barcode sequences. Defaults to False.

spacr.sequencing.single_read_chunked_processing(r1_file, r2_file, regex, target_sequence, offset_start, expected_end, column_csv, grna_csv, row_csv, save_h5, comp_type, comp_level, hdf5_file, unique_combinations_csv, qc_csv_file, chunk_size=10000, n_jobs=None, test=False, fill_na=False)[source]

Processes single-end FASTQ data in chunks to extract barcoded sequences and map them to identifiers.

This function reads gzipped R1 FASTQ data, detects barcode-containing sequences using a target anchor and regex, and maps row, column, and gRNA barcodes to user-defined identifiers. Results are processed in parallel and saved incrementally via a background process.

Parameters:

r1_file (str) – Path to gzipped R1 FASTQ file.
r2_file (str) – Placeholder for interface consistency; not used in single-end mode.
regex (str) – Regular expression with named capture groups: ‘rowID’, ‘columnID’, and ‘grna’.
target_sequence (str) – Anchor sequence used to locate barcode region.
offset_start (int) – Offset from anchor to start barcode parsing.
expected_end (int) – Length of the barcode region to extract.
column_csv (str) – Path to CSV file mapping column barcode sequences to IDs.
grna_csv (str) – Path to CSV file mapping gRNA barcode sequences to names.
row_csv (str) – Path to CSV file mapping row barcode sequences to IDs.
save_h5 (bool) – Whether to save the full reads DataFrame to HDF5 format.
comp_type (str) – Compression algorithm for HDF5 (e.g., ‘zlib’).
comp_level (int) – Compression level for HDF5.
hdf5_file (str) – Output HDF5 file path.
unique_combinations_csv (str) – Output path for CSV summarizing row/column/gRNA combinations.
qc_csv_file (str) – Output path for CSV summarizing missing values and total reads.
chunk_size (int, optional) – Number of reads per batch. Defaults to 10,000.
n_jobs (int, optional) – Number of parallel worker processes. Defaults to cpu_count() - 3.
test (bool, optional) – If True, processes only the first chunk and prints its result. Defaults to False.
fill_na (bool, optional) – If True, fills missing mapped IDs with their corresponding barcode sequences. Defaults to False.

spacr.sequencing.generate_barecode_mapping(settings={})[source]

Orchestrates barcode extraction and mapping from gzipped sequencing data using user-defined or default settings.

This function parses sequencing reads from single-end or paired-end FASTQ (.gz) files, extracts barcode regions using a regular expression, maps them to row, column, and gRNA identifiers, and saves the results to disk. Results include the full annotated reads (optional), barcode combination counts, and a QC summary.

Parameters:: settings (dict, optional) – Dictionary containing parameters required for barcode mapping. If not provided, default values will be applied. Important keys include: - ‘src’ (str): Source directory containing gzipped FASTQ files. - ‘mode’ (str): Either ‘single’ or ‘paired’ for single-end or paired-end processing. - ‘single_direction’ (str): If ‘single’, specifies which read to use (‘R1’ or ‘R2’). - ‘regex’ (str): Regular expression with capture groups ‘rowID’, ‘columnID’, and ‘grna’. - ‘target_sequence’ (str): Anchor sequence to locate barcode start position. - ‘offset_start’ (int): Offset from the anchor to the barcode start. - ‘expected_end’ (int): Expected barcode region length. - ‘column_csv’ (str): CSV file mapping column barcodes to names. - ‘grna_csv’ (str): CSV file mapping gRNA barcodes to names. - ‘row_csv’ (str): CSV file mapping row barcodes to names. - ‘save_h5’ (bool): Whether to save annotated reads to HDF5. - ‘comp_type’ (str): Compression algorithm for HDF5. - ‘comp_level’ (int): Compression level for HDF5. - ‘chunk_size’ (int): Number of reads to process per batch. - ‘n_jobs’ (int): Number of parallel processes for barcode mapping. - ‘test’ (bool): If True, only processes the first chunk for testing. - ‘fill_na’ (bool): If True, fills unmapped barcodes with raw sequence instead of NaN.

Side Effects:: Saves the following files in the output directory: - annotated_reads.h5 (optional): Annotated read information in HDF5 format. - unique_combinations.csv: Count table of (rowID, columnID, grna_name) triplets. - qc.csv: Summary of missing values and read counts.

spacr.sequencing.barecodes_reverse_complement(csv_file)[source]

Reads a barcode CSV file, computes the reverse complement of each sequence, and saves the result to a new CSV.

This function assumes the input CSV contains a column named ‘sequence’ with DNA barcodes. It computes the reverse complement for each sequence and saves the modified DataFrame to a new file with ‘_RC’ appended to the original filename.

Parameters:: csv_file (str) – Path to the input CSV file. Must contain a column named ‘sequence’.

Side Effects:

Saves a new CSV file in the same directory with reverse-complemented sequences.
Prints the path of the saved file.

Output:

New file path format: <original_filename>_RC.csv

spacr.sequencing.graph_sequencing_stats(settings)[source]

Analyze and visualize sequencing quality metrics to determine an optimal fraction threshold that maximizes unique gRNA representation per well across plates.

This function reads one or more CSV files containing count data, filters out control wells, calculates the fraction of reads per gRNA in each well, and identifies the minimum fraction required to recover a target average number of unique gRNAs per well. It generates plots to help visualize the chosen threshold and spatial distribution of unique gRNA counts.

Parameters:: settings (dict) – Dictionary containing the following keys: - ‘count_data’ (str or list of str): Paths to CSV file(s) with ‘grna’, ‘count’, ‘rowID’, ‘columnID’ columns. - ‘target_unique_count’ (int): Target number of unique gRNAs per well to recover. - ‘filter_column’ (str): Column name to filter out control wells. - ‘control_wells’ (list): List of control well labels to exclude. - ‘log_x’ (bool): Whether to log-scale the x-axis in the threshold plot. - ‘log_y’ (bool): Whether to log-scale the y-axis in the threshold plot.
Returns:: Closest fraction threshold that approximates the target unique gRNA count per well.
Return type:: float

Side Effects:

Saves a PDF plot of unique gRNA count vs fraction threshold.
Saves a spatial plate map of unique gRNA counts.
Prints threshold and summary statistics.
Displays intermediate DataFrames for inspection.