pandas 的 cumsum()
函数可以实现对列的累积求和。使用示例如下:
import pandas as pd
data = [1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
data = pd.DataFrame(data, columns=['diff'])
data['cumsum_num'] = data['diff'].cumsum()
print(data)
输出结果:
diff cumsum_num
0 1 1
1 0 1
2 0 1
3 0 1
4 1 2
5 0 2
6 0 2
7 1 3
8 0 3
9 0 3
10 0 3
对于 pyspark 没有 cumsum()
函数可以直接进行累加求和,若要实现累积求和可以通过对一列有序的列建立排序的 Window 进行求和,代码如下所示:
创建 DataFrame 对象:
import pyspark
from pyspark.sql import functions as fn
from pyspark.sql import SparkSession
from pyspark.sql import Window
import pandas as pd
conf = pyspark.SparkConf().setAll([])
spark_session = SparkSession.builder.appName('test_app').config(conf=conf).getOrCreate()
sc = spark_session.sparkContext
sc.setLogLevel('WARN')
data = [1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
data = pd.DataFrame(data, columns=['diff'])
data['number'] = range(len(data))
data = spark_session.createDataFrame(data, schema=['diff', 'number'])
data.show()
原 DataFrame 数据:
+----+------+
|diff|number|
+----+------+
| 1| 0|
| 0| 1|
| 0| 2|
| 0| 3|
| 1| 4|
| 0| 5|
| 0| 6|
| 1| 7|
| 0| 8|
| 0| 9|
| 0| 10|
+----+------+
根据 number 排序实现累积求和:
win = Window.orderBy('number')
data = data.withColumn('cumsum_num', fn.sum(data['diff']).over(win))
data.show()
结果为:
+----+------+----------+
|diff|number|cumsum_num|
+----+------+----------+
| 1| 0| 1|
| 0| 1| 1|
| 0| 2| 1|
| 0| 3| 1|
| 1| 4| 2|
| 0| 5| 2|
| 0| 6| 2|
| 1| 7| 3|
| 0| 8| 3|
| 0| 9| 3|
| 0| 10| 3|
+----+------+----------+