Retargeting Android Applications to Java Bytecode
tl;dr (too long; didn't read)
This paper expands on ded to more effectively convert Dalvik bytecode to Java class files (successfully retarged 99.99% of tested classes, 99.64% of apps' conversion completely verifiable).
- We introduce the Tyde intermediate representation for structured semantic mapping between the VMs. All 257 Dalvik instructions are translated using only 9 translation rules.
- Because sound bytecode typing is necessary for verifiability, we use a strong constraint-based type inference algorithm.
- We introduce code transformations to fix unverifiable input bytecode. In addition to making the code verifiable, these transformations accurately mirror VM runtime behavior.
Retargeting took 20 minutes for 1,100 apps. (top 50 most popular apps in the 22 app categories as of Sept 1, 2010) The complete processing of all applications including Dalvik pre-verification (modiﬁed Dalvik verifier), retargeting (Dare) and assembly (Jasmin) took less than 70 compute-minutes.
Over 20% of applications in the sample have unverifiable Dalvik bytecode in at least one class.
We are not concerned with optimality but only with semantic equivalence.
The insight behind Tyde is that, by typing all instruction arguments, load/store operations can be translated independently of opcodes.
The application bytecode is initially translated into the Tyde intermediate representation (IR) in three steps:
- the .dex file is parsed and code structures, methods and the global constant pool are interpreted and annotated,
- a control ﬂow graph is generated and
- register types used in ambiguous instructions are inferred.
The Java bytecode is thereafter generated from this IR in three phases:
- a pre-processing step generates labels and maps registers to local variables,
- the IR is translated to Jasmin code, and
- the Jasmin tool generates the final .class files.
The Android OS has emerged as the leading platform for SmartPhone applications. However, because Android applications are compiled from Java source into platform-specific Dalvik bytecode, existing program analysis tools cannot be used to evaluate their behavior. This paper develops and evaluates algorithms for retargeting Android applications received from markets to Java class files. The resulting Dare tool uses a new intermediate representation to enable fast and accurate retargeting. Dare further applies strong constraint solving to infer typing information and translates the 257 DVM opcodes using only 9 translation rules. It also handles cases where the input Dalvik bytecode is unverifiable. We evaluate Dare on 1,100 of the top applications found in the free section of the Android market and successfully retarget 99.99% of the 262,110 associated classes. Further, whereas existing tools can only fully retarget about half of these applications, Dare can recover over 99% of them. In this way, we open the door to users, developers and markets to use the vast array of program analysis tools to ensure the correct operation of Android applications.
This paper has great detail on the comparison between Java and Dalvik bytecode.
The translation rules are elegant and insightful- creating 9 rules to translate 257 opcodes is impressive.
They claim to be a more systematic approach to ded, but rather than comparing them they say a direct comparison doesn't make sense.
I wish they had given more details about the hardware used for the evaluation performance stats.
Dalvik instructions are vastly different from Java instructions. DVM bytecode has 257 different instructions and 3 pseudo-instructions. Dalvik instructions are two to ten bytes long, and pseudo-instructions have a variable length. The DVM has substantially more instruction formats (over 20) than the JVM.
The DVM is register-based, whereas the JVM is stack-based.
During the path-sensitive type verification process, the Java verifier considers that any instruction in a try block may throw an exception. In reality, not all instructions in each try block are able to throw exceptions. Therefore, the Java verifier considers some unfeasible execution paths. On the other hand, the Dalvik verifier does not consider these unfeasible paths.
In Tyde, we introduce the notion of typed registers. It adds two elements to Dalvik registers: a type (tau) and information about whether the register is a source or destination register (represented by terminals delta(s) and delta(d)).
While parsing instructions, type information for registers is determined. For example, the types of several unary and binary operators can be known from their opcode, e.g., an add-long instruction takes two long integers as sources and a long integer as destination. Also, during this parsing step, for every instruction which uses a constant pool reference, a new Java constant is generated on the fly. The only exception is when the instruction is an ambiguous numeric constant assignment. In that case, type information is needed before the constant can be created.
Type inference for Dalvik bytecode uses the following approach: First we generate constraints on types based on definitions and uses. These constraints are then solved to infer unknown types. Note that our goal is not to determine types for all variables.
A Dalvik code in Tyde IR is translated into Java bytecode in three steps. In the first step, registers are mapped to Java local variables and labels are generated to support control-flow instructions. In the second step, instructions in Tyde IR are converted to Jasmin instructions (Jasmin is a Java bytecode assembler). The third step is to use Jasmin to generate Java .class bytecode.
- References to classes which are not available within the application or in the core Android classes. A special case is when
the superclass of a class is missing; then the class is trivially unverifiable and is not even linked by the DVM.
- Apps commonly use private Android APIs.
- References to methods or fields which are non-existent or not accessible (e.g., private member).
- Developers often include entire libraries to be able to use some classes from these libraries. Parts of the included libraries sometimes make calls to other libraries, which are not themselves included with the Android application.
Typing and other issues
Invalid typing caused by malformed class or member identifier, illegal access flag, etc.
The following are relevant diagrams from the paper. Several diagrams and performance graphs have been omitted, see the paper
Overall Dare System Architecture
Example of Code Representation at each Step
Tyde IR Construction
blog comments powered by Disqus