PySpark 1.5如何将时间戳从秒钟截断到最接近的分钟

编程入门行业动态更新时间:2024-10-24 10:20:53

本文介绍了PySpark 1.5如何将时间戳从秒钟截断到最接近的分钟的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述我正在使用PySpark。数据帧（'canon_evt'）中有一列（'dt'），这是一个时间戳。我试图从DateTime值中删除秒。它最初是作为一个字符串从镶木地板读取的。然后我尝试通过

canon_evt = canon_evt.withColumn（'dt'，to_date（canon_evt.dt））将其转换为Timestamp canon_evt = canon_evt.withColumn（'dt'，canon_evt.dt.astype（'Timestamp'））

然后我想删除秒。我尝试'trunc'，'date_format'，甚至试图将下面的部分连接在一起。我认为它需要一些地图和lambda组合，但我不确定Timestamp是否是一个适当的格式，以及是否可以摆脱秒。

canon_evt = canon_evt.withColumn（'dyt'，year（'dt'）+' - '+ month（'dt'）+ ' - '+ dayofmonth（'dt' ）+''+ hour（'dt'）+'：'+ minute（'dt'）） [Row（dt = datetime.datetime（2015，9，16，0，0），dyt = None）]

解决方案

转换为Unix时间戳基本的算术应该是窍门：从pyspark.sql导入行$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ unix_timestamp，round df = sc.parallelize（[ Row（dt ='1970-01-01 00:00:00'）， Row（dt ='2015 -09-16 05:39:46'）， Row（dt ='2015-09-16 05:40:46'）， Row（dt ='2016-03-05 02： 00:10'），]）。toDF（） ## unix_timestamp将字符串转换为Uni x时间戳（bigint / long） ##（秒）。除以60，圆，乘以60，并投掷 ##应该工作正常。 ## dt_truncated =（（round（unix_timestamp（col（dt））/ 60）* 60） .cast（timestamp）） df.withColumn（dt_truncated，dt_truncated）.show（10，False） ## + ------------------- + ----- ---------------- + ## | dt | dt_truncated | ## + ------------------- + --------------------- + ## | 1970-01-01 00：00：00 | 1970-01-01 00：00：00.0 | ## | 2015-09-16 05：39：46 | 2015-09-16 05：40：00.0 | ## | 2015-09-16 05：40：46 | 2015-09-16 05：41：00.0 | ## | 2016-03-05 02：00：10 | 2016-03-05 02：00：00.0 | ## + ------------------- + --------------------- +

I am using PySpark. I have a column ('dt') in a dataframe ('canon_evt') that this a timestamp. I am trying to remove seconds from a DateTime value. It is originally read in from parquet as a String. I then try to convert it to Timestamp via

canon_evt = canon_evt.withColumn('dt',to_date(canon_evt.dt)) canon_evt= canon_evt.withColumn('dt',canon_evt.dt.astype('Timestamp'))

Then I would like to remove the seconds. I tried 'trunc', 'date_format' or even trying to concatenate pieces together like below. I think it requires some sort of map and lambda combination, but I'm not certain whether Timestamp is an appropriate format, and whether it's possible to get rid of seconds.

canon_evt = canon_evt.withColumn('dyt',year('dt') + '-' + month('dt') + '-' + dayofmonth('dt') + ' ' + hour('dt') + ':' + minute('dt')) [Row(dt=datetime.datetime(2015, 9, 16, 0, 0),dyt=None)]

解决方案

Converting to Unix timestamps and basic arithmetics should to the trick:

from pyspark.sql import Row from pyspark.sql.functions import col, unix_timestamp, round df = sc.parallelize([ Row(dt='1970-01-01 00:00:00'), Row(dt='2015-09-16 05:39:46'), Row(dt='2015-09-16 05:40:46'), Row(dt='2016-03-05 02:00:10'), ]).toDF() ## unix_timestamp converts string to Unix timestamp (bigint / long) ## in seconds. Divide by 60, round, multiply by 60 and cast ## should work just fine. ## dt_truncated = ((round(unix_timestamp(col("dt")) / 60) * 60) .cast("timestamp")) df.withColumn("dt_truncated", dt_truncated).show(10, False) ## +-------------------+---------------------+ ## |dt |dt_truncated | ## +-------------------+---------------------+ ## |1970-01-01 00:00:00|1970-01-01 00:00:00.0| ## |2015-09-16 05:39:46|2015-09-16 05:40:00.0| ## |2015-09-16 05:40:46|2015-09-16 05:41:00.0| ## |2016-03-05 02:00:10|2016-03-05 02:00:00.0| ## +-------------------+---------------------+

更多推荐

PySpark 1.5如何将时间戳从秒钟截断到最接近的分钟

本文发布于:2023-10-16 05:30:16，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1496639.html