我将下面的JSON字符串加载到dataframe列中。

{     "title": {         "titleid": "222",         "titlename": "ABCD"     },     "customer": {         "customerDetail": {             "customerid": 878378743,             "customerstatus": "ACTIVE",             "customersystems": {                 "customersystem1": "SYS01",                 "customersystem2": null             },             "sysid": null         },         "persons": [{             "personid": "123",             "personname": "IIISKDJKJSD"         },         {             "personid": "456",             "personname": "IUDFIDIKJK"         }]     } }  val js = spark.read.json("./src/main/resources/json/customer.txt") println(js.schema) val newDF = df.select(from_json($"value", js.schema).as("parsed_value")) newDF.selectExpr("parsed_value.customer.*").show(false) 

//模式:

StructType(StructField(customer,StructType(StructField(customerDetail,StructType(StructField(customerid,LongType,true), StructField(customerstatus,StringType,true), StructField(customersystems,StructType(StructField(customersystem1,StringType,true), StructField(customersystem2,StringType,true)),true), StructField(sysid,StringType,true)),true), StructField(persons,ArrayType(StructType(StructField(personid,StringType,true), StructField(personname,StringType,true)),true),true)),true), StructField(title,StructType(StructField(titleid,StringType,true), StructField(titlename,StringType,true)),true)) 

//输出:

+------------------------------+---------------------------------------+ |customerDetail                |persons                                | +------------------------------+---------------------------------------+ |[878378743, ACTIVE, [SYS01,],]|[[123, IIISKDJKJSD], [456, IUDFIDIKJK]]| +------------------------------+---------------------------------------+ 

我的问题:有没有一种方法可以通过将Array columns保持原样将键值拆分为separate dataframe columns如下所示,因为one record per json string只需要one record per json string

customer column示例:

customer.customerDetail.customerid,customer.customerDetail.customerstatus,customer.customerDetail.customersystems.customersystem1,customer.customerDetail.customersystems.customersystem2,customerid,customer.customerDetail.sysid,customer.persons 878378743,ACTIVE,SYS01,null,null,{"persons": [ { "personid": "123", "personname": "IIISKDJKJSD" }, { "personid": "456", "personname": "IUDFIDIKJK" } ] } 

===============>>#1 票数:0

您可以在RDD的帮助下尝试,方法是在空的RDD中定义列名,然后读取json,使用.toDF()将其转换为DataFrame,然后将其迭代为空的RDD。

  ask by Leibnitz translate from so

===============>>#2 票数:0

莱布尼茨,您正在寻找这种转变吗?

val df = spark.read.json("jsonToHdfs/src/test/resources/json.json")  df.select($"customer.customerDetail.customerid".as("customerId"),   $"customer.customerDetail.customerstatus".as("customerstatus"),   $"customer.customerDetail.customersystems.customersystem1".as("customersystem1"),   $"customer.customerDetail.customersystems.customersystem2".as("customersystem2"),   $"customer.customerDetail.sysid".as("sysid"),   explode($"customer.persons").as("person"),   $"title.titleid".as("titleid"),   $"title.titlename".as("titlename") ).select(   $"customerId", $"customerstatus", $"customersystem1", $"customersystem2", $"sysid",   $"person.personid".as("personid"), $"person.personname".as("personname"), $"titleid", $"titlename" ).show(false) 

输出:

+----------+--------------+---------------+---------------+-----+--------+-----------+-------+---------+ |customerId|customerstatus|customersystem1|customersystem2|sysid|personid|personname |titleid|titlename| +----------+--------------+---------------+---------------+-----+--------+-----------+-------+---------+ |878378743 |ACTIVE        |SYS01          |null           |null |123     |IIISKDJKJSD|222    |ABCD     | |878378743 |ACTIVE        |SYS01          |null           |null |456     |IUDFIDIKJK |222    |ABCD     | +----------+--------------+---------------+---------------+-----+--------+-----------+-------+---------+ 

在这里,我从结构中选择嵌套字段,并将人员数组分解为多行。

  ask by Leibnitz translate from so

本文未有回复,本站智能推荐: