💾 Archived View for republic.circumlunar.space › users › johngodlee › posts › 2020-06-05-split_speci… captured on 2023-03-20 at 18:59:14. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2021-12-04)
-=-=-=-=-=-=-
DATE: 2020-06-05
AUTHOR: John L. Godlee
For my research assistant position I have been cleaning lots of taxonomic data for tree species in southern Africa. On the surface this seems simple, Brachystegia spiciformis gets split into c("Brachystegia", "spiciformis"). However, what about when the species is written as Brachystegia spiciformis var. kwangensis? Here is a list of possible species name forms I found in my dataset:
And that isn't counting the species with multiple below-species taxonomic ranks, like: Vachellia gerrardii subsp. gerrardii var. latisiliqua.
Separating these out by hand would take a very long time, so I wrote a function which does it for me.
First the function splits strings by spaces or optionally dots with no spaces, then it searches to see if a species is cf., meaning that the absolute species isn't known but a guess has been made, in which case species is replaces with indet (indeterminate) and the species is stored in the confer column. Then a similar process to search for both varieties and subspecies. If below-species ranks are to be returned then the dataframe is returned as is, otherwise the confer column replaces the indet in species if below-species ranks are not returned.
This function doesn't catch Brachystegia sp.2, but I have a separate function which replaces these with Brachystegia indet based on a lookup table supplied by the user.
#' Split full species name into genus, species, and optionally below-species taxonomic ranks #' #' @param x vector of genus and species names #' @param subsp logical, should lower taxonomic ranks be returned? #' #' @return dataframe of character vectors with one column per rank #' #' @export #' splitSpecies <- function(x, subsp = TRUE) { x <- strsplit(x, " |[a-z]\\.[a-z]") x <- lapply(x, function(y) { # genus genus <- y[1] # cf and species if (grepl("cf(\\.)?", y[2])) { species <- "indet" cf <- y[3] plus <- 1 } else { species <- y[2] cf <- NA_character_ plus <- 0 } if (!is.na(y[3+plus])) { sub_string <- paste(y[(3+plus):length(y)], collapse = " ") # variety if present if (grepl("var(\\.)?", sub_string)) { string <- strsplit(sub_string, " ") variety <- string[[1]][which(grepl("var(\\.)", string[[1]])) + 1] } else { variety <- NA_character_ } # subspecies if present if (grepl("subs(p)?(\\.)?", sub_string)) { string <- strsplit(sub_string, " ") subspecies <- string[[1]][which(grepl("subs(p)?(\\.)?", string[[1]])) + 1] } else { subspecies <- NA_character_ } c(genus, species, cf, subspecies, variety) } else { c(genus, species, cf, NA_character_, NA_character_) } }) out <- as.data.frame(do.call(rbind, x)) names(out) <- c("genus", "species", "confer", "subspecies", "variety")[1:length(out)] # Replace cf. as species is subsp. == FALSE if (subsp) { out <- out } else { out$species[!is.na(out$confer)] <- out$confer[!is.na(out$confer)] out <- out[,c("genus", "species")] } return(out) }