a.show()
foo	foo1
null	null

a.show()

foo

foo1

null

应用一些正则表达式并转换为rdd可能对您有用。

先使用textFile以下方法读取文件：

a=spark.read.option('multiline',"true").text('aa.json')
a.show(truncate=False)

+-------------------------------------+

|value |

+-------------------------------------+

|[[{"foo":"test1"},{"foo1":"test21"}],|

|[{"foo":"test2"},{"foo1":"test22"}], |

|[{"foo":"test3"},{"foo1":"test23"}]] |

+-------------------------------------+

现在我们可以使用pyspark.sql.functions.regexp_replace从每行中删除额外的方括号和尾随逗号：

from pyspark.sql.functions import regexp_replace
a = a.select(regexp_replace("value", "(^[(?=[))|((?<=])]$)|(,$)", "").alias("value"))
a.show(truncate=False)

+-----------------------------------+

|value |

+-----------------------------------+

|[{"foo":"test1"},{"foo1":"test21"}]|

|[{"foo":"test2"},{"foo1":"test22"}]|

|[{"foo":"test3"},{"foo1":"test23"}]|

+-----------------------------------+

这里的模式是逻辑或以下模式：

^[(?=[)：字符串开头后跟[[（第二[个是非捕获组）
(?<=])]$：]]在字符串的末尾（第]一个是非捕获组）
,$：字符串末尾的逗号
任何匹配的模式都将替换为空字符串。

现在转换为rdd并使用json.loads将行解析为字典列表。然后将所有这些字典合并到一个字典中并调用pyspark.sql.Row构造函数。最后调用.toDF转换回DataFrame。

From `How to merge two dictionaries in a single expression?`

This code works for python 2 and 3

def merge_two_dicts(x, y):

z = x.copy()   # start with x's keys and values
z.update(y)    # modifies z with y's keys and values & returns None
return z

import json
from pyspark.sql import Row
from functools import reduce

a.rdd.map(lambda x: Row(**reduce(merge_two_dicts, json.loads(x['value'])))).toDF().show()

如何在pyspark中读取多级json？

+-------------------------------------+

|value |

+-------------------------------------+

|[[{"foo":"test1"},{"foo1":"test21"}],|

|[{"foo":"test2"},{"foo1":"test22"}], |

|[{"foo":"test3"},{"foo1":"test23"}]] |

+-------------------------------------+

+-----------------------------------+

|value |

+-----------------------------------+

|[{"foo":"test1"},{"foo1":"test21"}]|

|[{"foo":"test2"},{"foo1":"test22"}]|

|[{"foo":"test3"},{"foo1":"test23"}]|

+-----------------------------------+

From `How to merge two dictionaries in a single expression?`

This code works for python 2 and 3

+-----+------+

| foo| foo1|

+-----+------+

|test1|test21|

|test2|test22|

|test3|test23|

+-----+------+

相关课程

相关电子书

如何在pyspark中读取多级json？

+-------------------------------------+

|value |

+-------------------------------------+

|[[{"foo":"test1"},{"foo1":"test21"}],|

|[{"foo":"test2"},{"foo1":"test22"}], |

|[{"foo":"test3"},{"foo1":"test23"}]] |

+-------------------------------------+

+-----------------------------------+

|value |

+-----------------------------------+

|[{"foo":"test1"},{"foo1":"test21"}]|

|[{"foo":"test2"},{"foo1":"test22"}]|

|[{"foo":"test3"},{"foo1":"test23"}]|

+-----------------------------------+

From How to merge two dictionaries in a single expression?

This code works for python 2 and 3

+-----+------+

| foo| foo1|

+-----+------+

|test1|test21|

|test2|test22|

|test3|test23|

+-----+------+

相关课程

相关文章

相关电子书

From `How to merge two dictionaries in a single expression?`