捕获段错误

编程入门 行业动态 更新时间:2024-10-26 00:31:20
本文介绍了捕获段错误 - R 中的“内存未映射"错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我在我们的集群上运行一些 R 脚本时遇到问题.问题突然出现(所有脚本都运行良好,但有一天他们开始给出 caught segfault 错误).我无法提供可重现的代码,因为我什至无法在我自己的计算机上重现错误——它只发生在集群上.我也对两组数据使用相同的代码 - 一组非常小并且运行良好,另一组适用于更大的数据帧(大约 1000 万行)并在某些点折叠.我只使用来自 CRAN 存储库的包;R 和所有软件包都应该是最新的.该错误出现在完全不相关的操作中,请参见以下示例:

I have a problem running some R scripts on our cluster. The problems appeared suddenly (all the scripts were working just fine but one day they started giving a caught segfault error). I cannot provide reproducible code because I can't even reproduce the error on my own computer - it only happens on the cluster. I am also using the same code for two sets of data - one is quite small and runs fine, the other one works with bigger data frames (about 10 million rows) and collapses at certain points. I am only using packages from CRAN repository; R and all the packages should be up to date. The error shows up at totally unrelated actions, see the examples below:

会话信息:

R version 3.4.3 (2017-11-30) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

将变量写入 NetCDF 文件

# code snippet library(ncdf4) library(reshape2) input <- read.csv("input_file.csv") species <- "no2" dimX <- ncdim_def(name="x", units = "m", vals = unique(input$x), unlim = FALSE) dimY <- ncdim_def(name="y", units = "m", vals = unique(input$y), unlim = FALSE) dimTime <- ncdim_def(name = "time", units = "hours", unlim = TRUE) varOutput <- ncvar_def(name = species, units = "ug/m3", dim = list(dimX, dimY, dimTime), missval = -9999, longname = species) nc_file <- nc_create(filename = "outFile.nc", vars = list(varOutput), force_v4 = T) ncvar_put(nc = nc_file, varid = species, vals = acast(input, x~y), start = c(1,1,1), count = c(length(unique(input$x)), length(unique(input$y)), 1))

此时,我收到以下错误:

At this point, I get the following error:

*** caught segfault *** address 0x2b607cac2000, cause 'memory not mapped' Traceback: 1: id(rev(ids), drop = FALSE) 2: cast(data, formula, fun.aggregate, ..., subset = subset, fill = fill, drop = drop, value.var = value.var) 3: acast(result, x ~ y) 4: ncvar_put(nc = nc_file, varid = species, vals = acast(result, x ~ y), start = c(1, 1), count = c(length(unique(result$x)), length(unique(result$y)))) An irrecoverable exception occurred. R is aborting now ... /opt/sge/default/spool/node10/job_scripts/122270: line 3: 13959 Segmentation fault (core dumped)

具有并行计算的复杂代码

*** caught segfault *** address 0x330d39b40, cause 'memory not mapped' Traceback: 1: .Call(gstat_fit_variogram, as.integer(fit.method), as.integer(fit.sills), as.integer(fit.ranges)) 2: fit.variogram(experimental_variogram, model = vgm(psill = psill, model = model, range = range, nugget = nugget, kappa = kappa), fit.ranges = c(fit_range), fit.sills = c(fit_nugget, fit_sill), debug.level = 0) 3: doTryCatch(return(expr), name, parentenv, handler) 4: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 5: tryCatchList(expr, classes, parentenv, handlers) 6: tryCatch(expr, error = function(e) { call <- conditionCall(e) if (!is.null(call)) { if (identical(call[[1L]], quote(doTryCatch))) call <- sys.call(-4L) dcall <- deparse(call)[1L] prefix <- paste("Error in", dcall, ": ") LONG <- 75L msg <- conditionMessage(e) sm <- strsplit(msg, " ")[[1L]] w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w") if (is.na(w)) w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L], type = "b") if (w > LONG) prefix <- paste0(prefix, " ") } else prefix <- "Error : " msg <- paste0(prefix, conditionMessage(e), " ") .Internal(seterrmessage(msg[1L])) if (!silent && identical(getOption("show.error.messages"), TRUE)) { cat(msg, file = outFile) .Internal(printDeferredWarnings()) } invisible(structure(msg, class = "try-error", condition = e))}) 7: try(fit.variogram(experimental_variogram, model = vgm(psill = psill, model = model, range = range, nugget = nugget, kappa = kappa), fit.ranges = c(fit_range), fit.sills = c(fit_nugget, fit_sill), debug.level = 0), TRUE) 8: getModel(initial_sill - initial_nugget, m, initial_range, k, initial_nugget, fit_range, fit_sill, fit_nugget, verbose = verbose) 9: autofitVariogram(lmResids ~ 1, obsDf, model = "Mat", kappa = c(0.05, seq(0.2, 2, 0.1), 3, 5, 10, 15), fix.values = c(NA, NA, NA), start_vals = c(NA, NA, NA), verbose = F) 10: main_us(obsDf[obsDf$class == "rural" | obsDf$class == "rural-nearcity" | obsDf$class == "rural-regional" | obsDf$class == "rural-remote", ], grd_alt, grd_pop, lm_us, fitvar_us, logTransform, plots, "RuralSt", period, preds) 11: doTryCatch(return(expr), name, parentenv, handler) 12: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 13: tryCatchList(expr, classes, parentenv, handlers) 14: tryCatch(main_us(obsDf[obsDf$class == "rural" | obsDf$class == "rural-nearcity" | obsDf$class == "rural-regional" | obsDf$class == "rural-remote", ], grd_alt, grd_pop, lm_us, fitvar_us, logTransform, plots, "RuralSt", period, preds), error = function(e) e) 15: eval(.doSnowGlobals$expr, envir = .doSnowGlobals$exportenv) 16: eval(.doSnowGlobals$expr, envir = .doSnowGlobals$exportenv) 17: doTryCatch(return(expr), name, parentenv, handler) 18: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 19: tryCatchList(expr, classes, parentenv, handlers) 20: tryCatch(eval(.doSnowGlobals$expr, envir = .doSnowGlobals$exportenv), error = function(e) e) 21: (function (args) { lapply(names(args), function(n) assign(n, args[[n]], pos = .doSnowGlobals$exportenv)) tryCatch(eval(.doSnowGlobals$expr, envir = .doSnowGlobals$exportenv), error = function(e) e)})(quote(list(timeIndex = 255L))) 22: do.call(msg$data$fun, msg$data$args, quote = TRUE) 23: doTryCatch(return(expr), name, parentenv, handler) 24: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 25: tryCatchList(expr, classes, parentenv, handlers) 26: tryCatch(do.call(msg$data$fun, msg$data$args, quote = TRUE), error = handler) 27: doTryCatch(return(expr), name, parentenv, handler) 28: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 29: tryCatchList(expr, classes, parentenv, handlers) 30: tryCatch({ msg <- recvData(master) if (msg$type == "DONE") { closeNode(master) break } else if (msg$type == "EXEC") { success <- TRUE handler <- function(e) { success <<- FALSE structure(conditionMessage(e), class = c("snow-try-error", "try-error")) } t1 <- proc.time() value <- tryCatch(do.call(msg$data$fun, msg$data$args, quote = TRUE), error = handler) t2 <- proc.time() value <- list(type = "VALUE", value = value, success = success, time = t2 - t1, tag = msg$data$tag) msg <- NULL sendData(master, value) value <- NULL }}, interrupt = function(e) NULL) 31: slaveLoop(makeSOCKmaster(master, port, timeout, useXDR)) 32: parallel:::.slaveRSOCK() An irrecoverable exception occurred. R is aborting now ...

是否可能是集群而不是代码(或 R)存在问题?我不知道这是否相关,但从前一段时间以来,我们一直收到如下错误消息:

Is it likely that there is an issue with the cluster rather than the code (or R)? I don't know if it could be related, but since some time ago we've been getting error messages like these:

Message from syslogd@master1 at Mar 8 13:51:37 ... kernel:[Hardware Error]: MC4 Error (node 1): DRAM ECC error detected on the NB. Message from syslogd@master1 at Mar 8 13:51:37 ... kernel:[Hardware Error]: Error Status: Corrected error, no action required. Message from syslogd@master1 at Mar 8 13:51:37 ... kernel:[Hardware Error]: CPU:4 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c08400067080a13 Message from syslogd@master1 at Mar 8 13:51:37 ... kernel:[Hardware Error]: MC4_ADDR: 0x000000048f32b490 Message from syslogd@master1 at Mar 8 13:51:37 ... kernel:[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)

我已尝试根据 this question 卸载并重新安装软件包,但它没有没救了.

I have tried to uninstall and reinstall packages based on this question but it didn't help.

推荐答案

这不是对问题的真正解释或令人满意的答案,但我更仔细地检查了代码并发现在第一个示例中,使用时出现了问题reshape2 包中的 acast.我在这种情况下删除了它,因为我意识到那里实际上并不需要它,但可以用 reshape 包中的 reshape 替换它(如 另一个问题): reshape(input, idvar="x", timevar="y", direction="wide")[-1].

It's not really an explanation of the problem or a satisfactory answer but I examined the codes more closely and figured out that in the first example, the problem appears when using acast from the reshape2 package. I deleted it in this case because I realized it's not actually needed there but it can be replaced with reshape from the reshape package (as shown in another question): reshape(input, idvar="x", timevar="y", direction="wide")[-1].

对于第二个示例,要找到问题的确切原因并不容易,但作为一种解决方法,在我的案例中有助于设置较少数量的用于并行计算的内核 - 集群有 48 个,我只使用了 15 个因为即使在此问题之前,如果代码使用所有 48 个内核运行,R 也会内存不足.当我将核心数量减少到 10 个时,它突然开始像以前一样工作.

As for the second example, it's not easy to find the exact cause of the problem but as a workaround in my case helped to set a smaller number of cores used for parallel computation - the cluster has 48, I was using only 15 since even before this issue R was running out of memory if the code was run using all 48 cores. When I reduced the number of cores to 10 it suddenly started working like before.

更多推荐

捕获段错误

本文发布于:2023-08-05 05:41:18,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1303014.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:错误

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!