repository for geography work and other things completed by kufre u.
Although I was originally reluctant to solely use dplyr
and purrr
because my first two slower, I went on to try to replace both loops in
st_segments
with tidyverse
functions.
st_segments_tbl = function(x, max_length) {
# breaks linestrings or polygons into line segments, keeps attributes
# x: layer with linestrings/polygons
# max_length: optional parameter, max length of segments, unit from crs
library(sf)
library(dplyr)
library(purrr)
layer = mutate(x, value = row_number())
if (!missing(max_length)) {
layer = st_segmentize(layer, max_length)
}
coord = st_coordinates(layer)
col = ncol(coord)
geom = as_tibble(coord) %>%
group_by(!!sym(colnames(coord)[col])) %>%
mutate(X2 = lead(X),
Y2 = lead(Y)) %>%
ungroup() %>%
filter(!is.na(X2)) %>%
mutate(geometry = pmap(list(X, Y, X2, Y2), function(x, y, x2, y2)
st_linestring(rbind(
c(x, y), c(x2, y2)
)))) %>%
{
st_sfc(.$geometry, crs = st_crs(layer))
}
as_tibble(coord[, col]) %>%
group_by(value) %>%
slice(-1) %>%
ungroup %>%
left_join(st_drop_geometry(layer), by = "value") %>%
select(-1) %>%
mutate(geometry = geom) %>%
st_sf()
}
Rather than casting the features to points and working with two geometry
columns as I had with my first two attempts of the function, I went on
to make the coordinate matrix from st_coordinates
a tibble and
replicated the loop in st_segments
with pmap
from purrr
and
lead
. I also found that using st_as_binary
here is wholly
unnecessary and that doing without it leads to a considerable speed
increase. Straying away from the loops also makes the function more
readable.
bench::mark(
base = st_geometry(st_segments(streets, 1)),
"dplyr/purrr" = st_geometry(st_segments_tbl(streets, 1)),
iterations = 5
) %>%
mutate(name = as.character(expression),
max = bench::as_bench_time(lapply(time, max))) %>%
select(
name,
min,
median,
max,
"memory allocation" = mem_alloc,
iterations = n_itr,
"total time" = total_time
)
## # A tibble: 2 x 7
## name min median max `memory allocatio~ iterations `total time`
## <chr> <bch:tm> <bch:t> <bch:t> <bch:byt> <int> <bch:tm>
## 1 base 31.71s 32.76s 33.68s 2.9GB 5 2.71m
## 2 dplyr/pur~ 5.26s 5.44s 6.34s 131.8MB 5 28.04s
st_segments_tbl
uses significantly less memory than st_segments
and
takes a sixth of the time to run which is remarkable. It also runs
faster than the previous dplyr
version of st_segments
. What really
slows st_segments
down seems to be the first loop as replacing the
second loop with either dplyr
or data.table
code did allow for the
function to run a little bit faster, though not considerably. This next
benchmark shows that this is indeed the case.
test = st_coordinates(streets)
bench::mark(
loop = {
(function(x) {
col = ncol(x)
ids = x[, col]
feat = unique(ids)
segments = list()
for (i in feat) {
coords = x[ids == i, ]
for (j in 1:(nrow(coords) - 1)) {
segments[[length(segments) + 1]] = st_as_binary(st_linestring(coords[j:(j + 1), 1:2]))
}
}
st_as_sfc(segments)
})(test)
},
tidy = {
(function(x){
as_tibble(x) %>%
group_by(L1) %>%
mutate(X2 = lead(X),
Y2 = lead(Y)) %>%
ungroup() %>%
filter(!is.na(X2)) %>%
mutate(geometry = pmap(list(X, Y, X2, Y2), function(x, y, x2, y2)
st_linestring(rbind(
c(x, y), c(x2, y2)
)))) %>%
{
st_sfc(.$geometry)
}
})(test)
},
iterations = 50) %>%
mutate(name = as.character(expression),
max = bench::as_bench_time(lapply(time, max))) %>%
select(
name,
min,
median,
max,
"memory allocation" = mem_alloc,
iterations = n_itr,
"total time" = total_time
)
## # A tibble: 2 x 7
## name min median max `memory allocation` iterations `total time`
## <chr> <bch:tm> <bch:tm> <bch:tm> <bch:byt> <int> <bch:tm>
## 1 loop 981ms 1.07s 1.44s 68.44MB 50 53.85s
## 2 tidy 172ms 194.54ms 266.53ms 2.29MB 50 9.83s
I was having trouble using sym
within bench::mark
so I needed to
explicitly state which column to group by in the tidy version. Even when
working with a smaller dataset, the same difference in time can be
observed between using a loop and dplyr
/purrr
. Looping through the
st_coordinates
matrix still takes around six times as long to run.
Given just how much faster and more efficient st_segments_tbl
is than
st_segments
, st_segments_tbl
may be my best option for now. Tibbles
print nicely so that’s also a plus.